Sox genes in grass carp (Ctenopharyngodon idella) with their implications for genome duplication and evolution

The Sox gene family is found in a broad range of animal taxa and encodes important gene regulatory proteins involved in a variety of developmental processes. We have obtained clones representing the HMG boxes of twelve Sox genes from grass carp (Ctenopharyngodon idella), one of the four major domestic carps in China. The cloned Sox genes belong to group B1, B2 and C. Our analyses show that whereas the human genome contains a single copy of Sox4, Sox11 and Sox14, each of these genes has two co-orthologs in grass carp, and the duplication of Sox4 and Sox11 occurred before the divergence of grass carp and zebrafish, which support the "fish-specific whole-genome duplication" theory. An estimation for the origin of grass carp based on the molecular clock using Sox1, Sox3 and Sox11 genes as markers indicates that grass carp (subfamily Leuciscinae) and zebrafish (subfamily Danioninae) diverged approximately 60 million years ago. The potential uses of Sox genes as markers in revealing the evolutionary history of grass carp are discussed.


INTRODUCTION
The Sox (SRY-related genes containing an HMG box; HMG, high mobility group) gene family was first identified in 1990 as a group of genes related to the mammalian testis determining factor Sry based on conservation of the single HMG box, which encodes a 79-amino acid DNA-binding HMG domain [10]. The number of known Sox genes has been expanded through homology-based screening approaches recently [4,7,42]. The roles of some Sox proteins have been revealed as important developmental regulators in a variety of developmental processes, e.g. during sex determination and the development of heart and CNS (central nervous system) [2,35].
For all Sox proteins, the HMG domains, outside of which Sox sequences are highly variable, are highly conserved in primary structure, and all appear to be capable of binding to the same target DNA sequence [39]. Specificity in target selection is considered to be brought about via a combinatorial mechanism involving interaction with other tissue-specific transcription factors and spatio-temporal expression patterns [15,16,40]. A total of 20 Sox genes has now been identified in the mouse and human by whole-genome sequence analyses [35], and primary sequence comparison and other structural indicators such as intron-exon organization indicate that these genes fall into eight clear groups, A-H [2].
Sox genes have been identified in a broad range of animal taxa, including mammals, birds, reptiles, amphibians, fish, insects and nematodes. Among vertebrates, orthologs in different species are highly similar to each other. Most of these groups are represented by a single gene in the invertebrate model organisms Drosophila melanogaster and Caenorhabditis elegans, suggesting the occurrence of expansion of this single gene into multiple related genes during vertebrate evolution [2]. Since the divergence of regulatory genes is being considered necessary to bring about phenotypic variation and an increase in biological complexity, it is proposed that such duplication events have indeed been of major importance for evolution in vertebrates [25,30,31]. It is not yet clear whether this expansion is the result of numerous rounds of independent gene duplications or the two-round (three-round to fish) genome duplications during the evolution of vertebrates [38]. Analyses of additional vertebrate genomes are required to refine the understanding of molecular evolution of this gene family.
However, the organization and function of the Sox gene family are far less well understood in vertebrates than in mammals. Surprisingly, there have been limited studies on Sox genes in teleost fishes, which comprise half of the vertebrate species [3,9,17,44] and whose development mechanisms may be related but markedly different to those employed by mammals [8]. In the present study, we cloned 12 members of the Sox gene family in grass carp (Ctenopharyngodon idella), one of the most important herbivorous fishes in the world and a species of particular significance in fisheries and aquaculture in China. The phylogenetic analyses and their implications for fish genome duplication and evolution were also exploited. The major aim of this study was to extend our understanding on the organization and evolution of such important regulatory genes as the Sox transcription factor gene family in a teleost fish with economic potentials.

Amplification, cloning and sequencing of Sox genes
Sox genes were amplified by PCR using a pair of degenerate primers designated as SoxX [9]. The SoxX primers (ATGAAYGCNTTYATGGTNTGG and GGNCGRTAYTTRTARTCNGG) correspond to the motifs MNAFMVW and PDYKYRP, which are found in the HMG boxes of almost all Sox proteins that belong to groups B and C. As template in PCR amplification, genomic DNA was extracted from fresh fin tissues of grass carp using the traditional phenolchloroform method. The PCR reactions contained 50 pmol of each primer, 0.1 µg genomic DNA, 200 µM dNTPs, 1.5 mM MgCl 2 and 2.5 units of Taq DNA polymerase in a 25 µL reaction mix. PCR amplifications were performed for 35 successive cycles of 95 • C for 1 min, 54 • C for 1 min, and 72 • C for 1 min. The PCR products were resolved on 1.5% agarose gels, and the expected bands were excised and gel-purified. Purified products were then subcloned into the vector pMD-18-T. The positive clones were sequenced on an ABI 3730 capillary sequencer.

Sequence analysis and phylogenetic construction
Sequence identities were assigned following combined analyses including BlastX searches of GenBank and the human genome sequences, searches of signature residues from putative amino acid sequences [17] and phylogenetic clustering analyses using software MEGA3.1 [18]. The relationships between the proteins encoded by the grass carp Sox genes and the corresponding proteins in humans (and zebrafish in the case of Sox21) were analyzed using the Minimum-Evolution method [34] with the Dayhoff Matrix Model [5]. In order to estimate the relative age of the paralogs of duplicate Sox genes in grass carp, the number of nucleotide substitutions at third codon positions of Sox4 and Sox11 and also Sox1, Sox2 and Sox3 were used to construct a linearized tree by the Minimum-Evolution method with the p-distance model using thirdcodon position substitutions of the corresponding nucleotide sequences and 1000 bootstrap replicates were performed. For the estimation of the divergence time between grass carp and zebrafish, a linearized tree of Sox1, Sox3 and Sox11 was constructed by using third-codon position substitutions of the corresponding nucleotide sequences and Minimum-Evolution method with the p-distance model, and 1000 bootstrap replicates were performed. Another linearized tree of concatenated dataset of Sox1a, Sox3 and Sox11a was also built using synonymous substitutions by the UPGMA method [36] with the modified Nei-Gojobori (p-distance) model, transition/transversion ratio of 2.

Assignment of orthology and nomenclature of the grass carp Sox genes
Electrophoresis of PCR products generated from genomic DNA using the SoxX primers showed a single band at a size of about 200 bp as expected for intron-free Sox genes. This band was gel-purified and subcloned, and 36 individual clones were sequenced. Twelve different Sox genes (Sox1, Sox2, Sox3, Sox4a, Sox4b, Sox11a, Sox11b, Sox12, Sox14a, Sox14b, Sox21a, Sox21b) were obtained in grass carp (GenBank accession numbers DQ642604-DQ642615), and the alignment results of these genes are shown in Figure 1.
Each of these genes was represented in at least two independent clones except for Sox11b and Sox21b, and the number of the variable sites among these Sox sequences is too large to result from PCR errors, making it unlikely that our data would be affected by PCR artifacts. Since our combined analyses led to unambiguous gene assignment, we refer to the grass carp Sox genes by their nomenclature proposed in this paper.
The relationships between the proteins encoded by grass carp Sox genes and the corresponding proteins from human or zebrafish were analyzed, and the phylogenetic tree is shown in Figure 2. The clustering of grass carp and human sequences provides strong support for the gene assignments given to the grass carp sequences. All cloned Sox genes have mammalian orthologs. The human genome contains a single copy of Sox4, Sox11, Sox14 and Sox21, whereas the grass carp holds two copies of the corresponding genes.

Mapping duplication events onto phylogeny
Phylogenetic analysis shows that at least three of the ancestral vertebrate Sox genes (Sox4, Sox11, Sox14) are duplicated in grass carp (Fig. 2). Duplicates of Sox4 and Sox11 have also been identified in the zebrafish [6,26]. In order to estimate the age of the paralogs, the number of nucleotide substitutions at third codon positions was used to construct a linearized tree from the data of Sox4, Sox11, Sox1, Sox2 and Sox3 in grass carp and zebrafish since most third-codon position substitutions do not result in amino-acid replacements, the rate of fixation of these substitutions is expected to be relatively constant in different protein-coding genes (e.g. [29]) and to reflect the overall mutation rate [14]. The linearized tree (Fig. 3), with high bootstrap values, clearly demonstrated that the duplication of Sox4 and Sox11 occurred before the divergence of grass carp (subfamily Leuciscinae) and zebrafish (subfamily Danioninae), which belong to the different subfamilies of Cyprinidae of the order Cypriniformes, and are likely within the same time range. Although two copies of Sox1 were reported in zebrafish, only one ortholog was cloned in grass carp in the present study. This may result from inadequate sampling and sequencing of the clones or, with less probability, that the other copy of the Sox1 gene was lost in the grass carp after the divergence of these two species.

Dating the divergence of grass carp and zebrafish
The number of synonymous substitutions per synonymous sites can be used to estimate divergence times based on the molecular clock model [28,29]. We constructed a linearized tree based on third codon positions of Sox1, Sox3 and Sox11 of grass carp, zebrafish, Takifugu, sea bass and Amphilophus citrinellum, respectively. It shows a fine phylogenetic resolution (Fig. 4) which is consistent with the previously reported phylogeny of these fish species. Furthermore, the patterns of the subtrees are nearly the same, which represent Sox1, Sox3 and Sox11, respectively.
With a combined larger dataset, improved phylogenetic resolution is expected [20]. Combined datasets were concatenated, based on the assumption that the observed divergence for Sox genes corresponds to speciation events. Due to limited species availability, we combined sequences from different species belonging to the same larger fish taxon. Thus Sox1a and Sox3 of sea bass were combined with Sox11a of Amphilophus citrinellum (Fig. 5) since these two species both belong to the Perciformes. The date of divergence of zebrafish and Takifugu, set at the root of the tree, was taken as ∼290 (±6) Mya (million years ago) from literature based on calibration from molecular data [19]. The result shows that the divergence time between grass carp and zebrafish is ∼63 (±2) Mya.

New evidence for the "fish-specific whole-genome duplication" theory
In total, 12 grass carp Sox genes including members of the SoxB1, SoxB2 and SoxC groups [2] have been cloned and identified (Fig. 1). This is the first report describing Sox genes in the subfamily Leuciscinae and the data presented constitute one of the most complete analyses of the Sox gene family in fish to date.
The sequences described in this study are well conserved at the amino acid level and all the cloned Sox genes have mammalian orthologs. Several clones obtained in this study represent genes that are duplicated in grass carp with respect to the mammalian Sox gene family. Gene duplication appears to be very common in fish, although in many situations it has not yet been determined whether this is due to numerous independent segmental duplications or an ancestral whole-genome duplication [38].
Whereas the human genome contains a single copy of Sox4, Sox11 and Sox14, each of these genes have two co-orthologs in grass carp (Fig. 2). The clustering of grass carp and human sequences illustrates the gene duplications that have occurred in grass carp, although nearly all duplicate genes showed an "outgroup" topology instead of the "sister-gene" topology expected for the duplicate genes. Occasionally, duplicate genes show such an "outgroup" topology because of the accelerated evolution rate of one copy of duplicated genes, because of its release from previous selective pressure. Evolutionary rates of two paralogs often differ enormously; usually one of the paralogs evolves considerably faster than the other one [37]. Exhaustive searches of orthologs in other fish from the literature and GenBank indicate that some species also have duplicates of certain Sox genes, especially Sox1, Sox4, Sox11 and Sox21 in zebrafish, Sox1, Sox4, Sox14 in sea bass (Perciformes) [9] and Sox1, Sox14 in Takifugu (Tetraodontiformes) [17]. This suggests the occurrence of a largescale duplication or whole-genome duplication event relatively early during teleost evolution with regards to Sox genes since these genes are distributed on different chromosomes in all genomes characterized so far [35]. Gene silencing and subsequent loss could happen within a short time after a gene duplication event [21][22][23][24], so the failure to detect duplicates for other Sox genes, e.g. Sox3, may be explained by the subsequent gene loss after duplication, considering the general high rate of loss of duplicate gene copies in vertebrates (∼67% over ∼500 million years [27]) and the time of the assumed fish-specific genome duplication event, which has been dated to 335-404 Mya [13].
If the fish-specific genome duplication before the teleost radiation [13,38] is a real historical event, and the duplication of certain Sox genes in fish is the consequence of this event, then the divergence of the two copies of the duplicated Sox genes must have occurred before the divergence of grass carp and zebrafish, which both belong to Cyprinidae of the Cypriniformes. Our analyses using the number of nucleotide substitutions at third codon positions (Fig. 3) showed that the duplication of Sox4 and Sox11 took place before the divergence of grass carp and zebrafish, therefore further supporting the "fishspecific whole-genome duplication" theory.
The present data unexpectedly showed abnormal tree topology with regards to gc-Sox21b. It seems that gc-Sox21b is neither the ortholog of hu-Sox21 nor zf-Sox21b, but the co-ortholog of zf-Sox21a (Fig. 1). Considering that the number of variation sites of the obtained nucleotide sequences between gc-Sox21a and gc-Sox21b is only 6, which is much less than that between other couples of duplicated Sox genes, e.g. 35 between gc-Sox14a and gc-Sox14b, the age of this gene, gc-Sox21b, must be relatively young and probably the consequence of a much recent segmental duplication rather than the ancient fish-specific whole-genome duplication event. Furthermore, among nucleotide differences between gc-Sox21a and gc-Sox21b, half of the substitutions are nonsynonymous substitutions in contrast to that almost all the substitutions are synonymous between other paralogs of duplicate Sox genes, indicating that the additional copy of Sox21, gc-Sox21b, is still in the period of relaxed selection experienced by most duplicated genes in their early history [24], again suggesting its recent origin.

Duplicates of Sox: meaning for subfunction partitioning
After gene or genome duplication, each gene copy may follow a separate evolutionary trajectory. New gene duplicates face one of three fates, i.e. non-functionalization, neo-functionalization, and subfunctionalization [31,32]. Since most of the nucleotide differences between duplicates of gc-Sox genes represent synonymous substitutions not altering the encoded amino acid, it would appear that the sequences are under considerable selective pressure, strongly suggesting their taking on function, and excluding the possibility that the duplicates are pseudogenes.
While the origin of a new function appears to be a very rare fate for a duplicate gene [24], subfunction partitioning is relatively common among duplicated genes arising from the fish-specific genome duplication event [31]. Previous studies have demonstrated that Sox11a and Sox11b may share the developmental domains of the single Sox11 gene present in mice and chickens for their expressing in both overlapping and distinct sites [6].
The partitioning of ancestral subfunctions between gene copies arising from the whole-genome duplication could have contributed to the speciation and radiation of teleost fish. Beyond its importance for understanding mechanisms generating biodiversity, the partitioning of subfunctions between teleost coorthologs of human genes can facilitate the identification of tissue-specific conserved noncoding regions, and can simplify the analysis of ancestral gene functions obscured by pleiotropy or haploinsufficiency [1,31,41]. It is known that the functions of some Sox genes in fish are related to those in humans in the key developmental process such as development of CNS and cartilage formation [6,43]. Further studies on the partitioning of subfunctions between the Sox gene co-orthologs in grass carp will be worthwhile. Anyhow, the Sox genes presented in this study were proved to be good candidates for studies on the structure, function and evolution of genes and genomes in vertebrates.

Estimation for the origin of grass carp from the Sox molecular clock
The grass carp is one of the four major domestic carps in China, or so called "Chinese carps", of which the origin is yet unclear, thus the historical information based on the calibration from molecular data of nuclear genes will be valuable. We constructed linearized trees based on the molecular clock theory using the nuclear genes Sox1, Sox3 and Sox11 as markers, which are located on different chromosomes thus representing independent unlinked loci in all of the genomes characterized so far. Our result (Figs. 4 and 5) indicates that grass carp (subfamily Leuciscinae) and zebrafish (subfamily Danioninae) shared a last common ancestor approximately 63 (±2) Mya, which falls in the Paleocene (57-65 Mya). This result shows agreement with the fossil record of fish of Leuciscinae in the Oligocene (23-35 Mya) [33] and the previous notion that the fish of Leuciscinae originate from the ancient species of the original subfamily Danioninae in the early Paleogene [11]. It is known that the speciation of grass carp was related to the uplift of the Tibetan plateau, which began from the Neogene (1.6-23 Mya) [12]. Further studies on Sox genes of other cyprinid fish species with closer phylogenetic relationships with grass carp, e.g. silver carp (Hypophthalmichthys molitrix), will be informative in elucidating the accurate evolutionary history of grass carp.