Main

Lancelets, or amphioxus, are small worm-like marine animals that spend most of their lives buried in the sea floor, filter-feeding through jawless, ciliated mouths. The vertebrate affinities of these modest creatures were first noted in the early part of the nineteenth century1,2, and were further clarified by the embryologist Alexander Kowalevsky3. In particular, Kowalevsky observed that, unlike other invertebrates, amphioxus shares key anatomical and developmental features with vertebrates and tunicates (also known as urochordates). These include a hollow dorsal neural tube, a notochord, a perforated pharyngeal region, a segmented body musculature (embryologically derived from somites) and a post-anal tail. Together, the vertebrates, urochordates and lancelets (also known as cephalochordates) constitute the phylum Chordata, descended from a last common ancestor that lived perhaps 550 million years ago.

Although Kowalevsky, Darwin and others recognized the evolutionary relationship between chordate groups, the greater morphological, physiological and neural complexity of vertebrates posed a puzzle: how did the chordate ancestor—presumably a simple creature that resembled a modern amphioxus or ascidian larva—make such a transition?

Perhaps the most prevalent hypothesis for the origins of vertebrate complexity is founded on the ideas of Susumu Ohno (1970)4, who proposed that vertebrate genomes were shaped by a series of ancient genome-wide duplications. In Ohno’s original proposal, lancelet and vertebrates genomes were enlarged relative to the basic invertebrate complement by one or two rounds of genome doubling, although subsequent work suggested that these events occurred on the vertebrate stem after divergence of the lancelet lineage5,6.

Although the sequencing of the human and other vertebrate genomes has shown that the gene number in vertebrates is comparable to, or only modestly greater than, that of invertebrates7,8, evidence for large-scale segmental or whole-genome duplications on the vertebrate stem has mounted, with the parallel realization that most gene duplicates from such events are rapidly lost (reviewed in ref. 9). The relatively few surviving gene duplicates from the vertebrate stem provide evidence for ancient paralogous relationships between groups of human chromosomes10,11,12,13,14 that plausibly arose from multiple rounds of whole-genome duplication before the emergence of modern vertebrates. However, the number, the timing and even the genomic scale of the duplication events, and their consequences for subsequent genome evolution, are poorly understood (for a review, see ref. 15), in part because the tunicate genomes are highly rearranged relative to the unduplicated early chordate karyotype (see below).

The Florida lancelet B. floridae (the generic name Branchiostoma refers to the characteristic perforated branchial arches) provides a critical point of reference for these studies16. This species and its relatives (collectively also known as amphioxus, derived from the Greek amphi+oxys, ‘sharp at both ends’) are widely regarded as living proxies for the chordate ancestor, in part owing to the general similarity of the modern amphioxus to putative fossil chordates from the early Cambrian Chengjiang fauna (Yunnanozoon lividum17, and the similar Haikouella lanceolatum18) and the middle Cambrian Burgess Shale (Pikaia gracilens19), although controversy remains (see, for example, refs 20,21). The study of key developmental genes in amphioxus has shed light on the evolution of such vertebrate organs as the brain, kidney, pancreas and pituitary, and of the genetic mechanisms of early embryonic patterning in general (reviewed by refs 22–24). Amphioxus has also served as a genomic surrogate for the proto-vertebrate ancestor in studies of the Hox cluster25, in studies of specific genomic regions26,27,28, and as an outgroup in numerous gene family studies (reviewed in ref. 29).

Here we report the draft genome sequence of the Florida lancelet and compare its structure with the genomes of other animals. Robust phylogenetic analysis of gene sequences and exon–intron structures confirms recent proposals that tunicates are the sister group to vertebrates, with lancelets as the most basal chordate subphylum, and that the combined echinoderm–hemichordate clade is sister to chordates. Through a comparative analysis, we identify 17 ancestral chordate linkage groups that are conserved in the modern amphioxus and vertebrate genomes despite over half a billion years of independent evolution. Over 90% of the human genome is encompassed within these linkage groups, which display a tell-tale fourfold redundancy that is consistent with whole-genome quadruplication on the vertebrate stem. Comparison with sequences from the sea squirt, lamprey, elephant shark and several bony fish constrains the timing of the whole-genome events to after the divergence of vertebrates from tunicates and lancelets, but before the split between cartilaginous and bony vertebrates. Within the resolution of our analysis, we find evidence for rounds of genome duplication both before and after the split between jawless vertebrates (for example, lamprey) and jawed vertebrates, although a period of octoploidy encompassing the divergence of jawless and jawed vertebrates remains a possibility. Although most duplicate genes from these whole-genome events have been lost, a disproportionate number of genes involved in developmental processes are retained.

Genome sequence

We sequenced the 520 megabase (Mb) amphioxus genome using a whole-genome shotgun strategy30 from approximately 11.5-fold redundant paired-end sequence coverage produced from random-sheared libraries with a range of insert sizes (Supplementary Note 2). Genomic DNA was prepared from the gonads of a single gravid male collected from Tampa Bay, Florida in July 2003, and exhibited extensive allelic variation (3.7% single nucleotide polymorphism, plus 6.8% polymorphic insertion/deletion; Supplementary Note 4). This is the highest level of sequence variation reported in any individual organism, exceeding that found in the purple sea urchin31. Assembly version 1 reports both haplotypes separately, whereas in assembly version 2 a single haplotype is selected at each locus (Supplementary Fig. 63). Assembly version 2 spans 522 Mb, with half of this sequence in 62 scaffolds longer than 2.6 Mb.

Currently there are no physical or genetic maps of amphioxus, so we could not reconstruct the genome as its 19 pairs of chromosomes32. Nevertheless, because half of the predicted genes are contained in scaffolds containing 138 or more genes, the current draft assembly is sufficiently long-range to permit useful analysis of conserved synteny with other species, as shown below. Comparison of the assembled sequence with open reading frames derived from expressed sequence tags (ESTs, see below) shows that the assembly captures more than 95% of the known protein-coding content, and comparison to finished clone sequences demonstrates the base-level and long-range accuracy of the assembly (Supplementary Note 2).

Protein-coding genes and transposable elements

We estimate that the haploid amphioxus genome contains 21,900 protein-coding loci. This gene complement was modelled with standard methods tuned for amphioxus, integrating homology and ab initio gene prediction methods with more than 480,000 ESTs derived from a variety of developmental stages24 (Supplementary Note 3). Approximately two-thirds of the protein-coding loci (15,123) are captured in both haplotypes. Transposable elements constitute 30% of the amphioxus genome assembly (Supplementary Table 5) and belong to >500 families. On the basis of their bulk contribution to the genome size, DNA transposons are twice more abundant than retrotransposons.

Polymorphism

The distribution of observed local heterozygosity over short length scales obeys a geometric distribution (Supplementary Fig. 1), consistent with the prediction of the random mating model, as observed in Ciona savignyi33, with a population mutation rate 4μNe = 0.0562, where μ is the per generation mutation rate and Ne is the effective population size. High heterozygosity can, in principle, be explained by: (1) a large effective population size maintained over many generations, (2) a high mutation rate per generation, or (3) the recent mixing of previously isolated populations; in the latter case, a geometric distribution of local heterozygosity would not be expected. Assuming a typical metazoan mutation rate on the order of one to ten substitutions per gigabase (Gb) per generation34,35, this observed heterozygosity between alleles suggests a large but plausible effective breeding population on the order of millions of individuals. The observed heterozygosity shows correlations at short distances that decay on scales greater than 1 kb, indicating extensive recombination in the population (Supplementary Fig. 2). An analysis of the ratio Ka/Ks of non-synonymous to synonymous substitutions shows evidence of purifying selection comparable to that found between mammalian species (Supplementary Note 4). Insertion/deletion polymorphisms are common, as found in other intra- and inter-genome comparisons36 (Supplementary Fig. 3). Structural variation between haplotypes also includes local inversions and tandem duplications (Supplementary Fig. 63).

Deuterostome relationships

With the draft amphioxus sequence in hand, we reconsidered the relationships within chordates and between deuterostome phyla (chordates, echinoderms and hemichordates). The traditional placement of lancelets as sister to vertebrates, with tunicates as the earliest diverging chordate subphylum, has recently been questioned37,38. A preliminary study39 using 146 gene loci (33,600 aligned amino acid positions) and trace data from the present amphioxus genome project found support for tunicates as the sister to vertebrates. This analysis, however, also suggested (albeit with limited statistical support) that amphioxus is more closely related to echinoderms than to tunicates or vertebrates, which would render chordates a paraphyletic group. A second study with a similarly sized set of genes but more diverse deuterostome taxa supported the early branching of the cephalochordate lineage, but not the close relationship of amphioxus and echinoderms40.

To address the controversial phylogenetic position of amphioxus, we analysed a much larger set of 1,090 orthologous genes (see Supplementary Note 5). Both bayesian and maximum likelihood methods support the new chordate phylogeny38,39,40 in which cephalochordates represent the most basal extant chordate lineage, with tunicates (represented by both Ciona intestinalis and Oikopleura dioica in our analysis) sister to vertebrates but with long branches that indicate higher levels of amino acid substitution (Fig. 1). Individual gene trees also lend support to this topology; genes supporting tunicates as a sister group to vertebrates outnumber those with amphioxus in this position by a 2:1 ratio. An analysis of intron gain and loss in deuterostomes provides independent support for amphioxus as the basal extant chordate subphylum (see below). We group echinoderms (that is, the purple sea urchin) and hemichordates (that is, the acorn worm) together (ambulacrarians) as sister to a monophyletic chordate clade, as in ref. 41 but in contrast to the suggestion of ref. 39. With the exception of the long-branched tunicates, the maximum likelihood tree suggests a roughly constant evolutionary rate of peptide change across the deuterostome tree, although an excess of substitutions is found in the vertebrates relative to the predictions of a simple molecular clock model.

Figure 1: Deuterostome phylogeny.
figure 1

Bayesian phylogenetic tree of deuterostome relationships with branch length proportional to the number of expected substitutions per amino acid position, using a concatenated alignment of 1,090 genes. The scale bar represents 0.05 expected substitutions per site in the aligned regions. Long branches for sea squirt and larvacean indicate high levels of amino acid substitution. This tree topology was observed in 100% of sampled trees (see Supplementary Note 5). Numbers in red indicate bootstrap support under maximum likelihood. Unlabelled nodes were constrained.

Intron evolution

To assess the evolution of gene structure within the deuterostomes and chordates, we compared the position and phase of amphioxus introns to those in other animals. Amphioxus and human (along with other vertebrates) share a large fraction of their introns (85% in alignable regions), which match precisely in both position and phase (Supplementary Note 6), as was also found in the sea anemone Nematostella vectensis42. We found that the intron-rich gene structures of the eumetazoan ancestor were carried forward to the common chordate ancestor with relatively few gains or losses. The tunicates C. intestinalis and Oikopleura dioica, however, share many fewer introns with vertebrates43 or amphioxus.

Notably, intron presence or absence carries a significant (as measured by bootstrap values) phylogenetic signal, and bayesian analysis of the associated character matrix supports the sister relationship between tunicates and vertebrates (Supplementary Fig. 8; see Methods). This is evidently due to shared gain or loss of introns along the stem group leading to their common ancestor, which remarkably is still detectable despite additional extensive secondary losses, and modest gains, in the tunicate lineages. Thus, intron dynamics provide independent support for the new chordate phylogeny.

Chordate gene families and novelties

Through comparison of the amphioxus gene set with those of other animals, we identified 8,437 chordate gene families with members in amphioxus and other chordates that each nominally represent the modern descendants of a single gene in the last common chordate ancestor (Supplementary Note 7). That ancestor certainly possessed more genes than this number, but the others are inaccessible to us now owing to subsequent sequence divergence and/or gene loss in the living chordates. Through subsequent gene family expansions (by means of both local and/or genome-wide duplications), these families account for 13,610 amphioxus genes, 13,401 human genes and 7,216 C. intestinalis genes. The markedly lower number of descendant genes in C. intestinalis is largely due to gene loss44, with the present analysis identifying 2,251 ancient chordate genes missing in this genome sequence. We found 8 apparent chordate stem gene losses (that is, genes found in sea urchin and at least one of fly and sea anemone, but not in vertebrates, amphioxus or C. intestinalis). A list of these genes can be found in Supplementary Table 10.

We identified 239 apparent chordate gene novelties, that is, gene families represented in amphioxus and at least one vertebrate or Ciona, but without an obvious direct counterpart in non-chordate genomes. These can be characterized42 as 137 families with no detectable sequence similarity to non-chordate genes (type I novelties), 10 containing one or more chordate-specific domains linked to pre-existing metazoan domains (type II novelties), and 92 with chordate-specific combinations of pre-existing metazoan domains (type III novelties; see Supplementary Note 7). These gene families and others of special interest to vertebrate biology are discussed in a separate paper45.

Amphioxus–vertebrate synteny

We have found extensive conservation of gene linkage on the scale of whole chromosomes (macro-synteny) between the amphioxus genome and those of vertebrates (represented in our analysis by human, chicken and teleost fish), but only limited conservation of local gene order (micro-synteny). Through comparative analysis of these conserved features, we reconstructed the gene complements of 17 linkage groups (that is, proto-chromosomes) of the last common chordate ancestor. When vertebrate genomes are analysed in the light of these putative ancestral chordate chromosomes, a clear pattern of global fourfold conserved macro-synteny is found, demonstrating that two rounds of whole-genome duplication occurred on the vertebrate stem.

Reconstruction of chordate linkage groups

To identify ancestral chordate linkage groups, we first noted that many individual amphioxus scaffolds show conserved syntenic associations with human chromosomes, reflecting conserved linkage between the two genomes (Fig. 2; see also Oxford Grid in Supplementary File 5; for simplicity, we emphasize the amphioxus–human comparison in the main text, and include similar results for chicken, stickleback and pufferfish as Supplementary Information). Ninety-six scaffolds (out of 129 that possess 20 or more independent vertebrate orthologues) have a significant (P < 0.05, multiple test corrected) concentration of orthologues on one or more human chromosome. In contrast, only 12 C. intestinalis scaffolds (out of 134 that contain 20 or more vertebrate orthologues) show significant synteny to human chromosomes.

Figure 2: Amphioxus–human synteny.
figure 2

Four amphioxus scaffolds from the non-redundant version 2 assembly with synteny to human chromosome (Chr.) 17. Note that orthologous genes from these scaffolds are concentrated in specific regions of the chromosome, and that several scaffolds (for example, 18 and 162, or 149 and 207) have a high density of hits to the same segments of the chromosome, which enables a partitioning of the human genome into 135 ancient segments. Supplementary File 5 contains an Oxford grid tabulating the number of orthologues for each scaffold-segment pair.

Genes on individual amphioxus scaffolds have orthologues that are generally concentrated in specific regions of vertebrate chromosomes (Fig. 2). Furthermore, multiple amphioxus scaffolds typically exhibit hits to the same sets of human chromosomal regions. Within each region, only limited conservation of gene order is observed (Methods; Supplementary Note 8). This pattern of conserved synteny shows that genome rearrangements have not erased the imprint of the genome organization of the last common chordate ancestor from the present human and amphioxus genomes. By using this pattern, we identified 135 human chromosomal segments (listed in Supplementary Table 14) that retain relict signals of the ancestral chordate karyotype despite chromosomal rearrangements in each lineage (Methods). These segments span a mean of 170 genes.

We exploited the pattern of amphioxus–human synteny to identify 17 ancient chordate linkage groups (CLGs) by clustering both amphioxus scaffolds and human chromosomal segments according to their pattern of hits in the other genome (Methods). The resulting ‘dot plot’ (Supplementary Fig. 64) shows that orthologues are concentrated in 17 distinct blocks. Within each block, gene order is considerably scrambled. The natural interpretation of these blocks is that each represents an ancient chordate linkage group that evolved into a defined group of chromosomal segments in amphioxus, human, chicken and teleost fish.

We tested our interpretation of the chordate linkage groups as coherently evolving segments by using fluorescent in situ hybridization (FISH) to demonstrate that 15 out of 16 scaffolds from CLG 15 localized to a single amphioxus chromosome. (FISH of the BACs corresponding to the sixteenth scaffold were ambiguous; see Supplementary Table 2.) Similarly, an independent study of amphioxus cosmids containing NK group homeobox genes in CLG 7 found that they localize to several distant regions of a single chromosome in amphioxus27,28. These data support the claim that the 17 putative ancestral chordate linkage groups have been maintained in modern amphioxus as coherent chromosomal segments. The 19 pairs of modern amphioxus chromosomes, however, imply at least (see below) two subsequent fissions in the amphioxus lineage.

Within these segments, nearly 60% of the human genes that possess amphioxus orthologues participate in the conserved linkage groups. This represents a lower bound, because short amphioxus scaffolds are less likely to be assigned to CLGs. Conversely, 88% of amphioxus gene models on scaffolds assigned to a CLG have their human orthologue in a conserved position (that is, in the same CLG). Remarkably, to the resolution of our analysis, some entire chromosomes (for example, human 18 and 21; chicken 7, 12, 15, 19, 21, 24 and 27) and chromosome arms (including human 5p) seem to have maintained their integrity (with local scrambling and some gene gain and loss) since the last common chordate ancestor. The CLGs defined by comparing amphioxus and vertebrate genomes also provide a new perspective on tunicate genome evolution, because it appears that C. intestinalis chromosomes 10, 12 and 14 are each relicts of a single CLG (11, 5 and 8, respectively), and other conserved linkages are evident (Supplementary Fig. 14).

Quadruple conserved synteny

We can trace the evolution of chordate genomes through time using two additional types of evidence. First, we can constrain the timing of specific chromosome breaks by parsimony analysis of conserved synteny across human, chicken and teleost genomes. Second, we can use the presence (or absence) of paralogous gene pairs to identify segments derived from the same chordate proto-chromosome by duplication (or fission).

For example, five groups of human segments from chromosomes 1, 5, 9 and 19 cluster together in CLG 1 (Supplementary Note 8; Supplementary Fig. 64). The segment pair 1.5/7 from chromosome 1 is related to each of the others by a significant concentration of ancient gene paralogues (17 to 31 pairs, P < 1 × 10-10), indicating that it is related to the others by duplication. In contrast, only a single pair of ancient paralogues relates segments 5.1 and 9.1/3, and orthologues of the genes in these segments occur predominantly on the same chromosomes of both pufferfish and stickleback. Thus, 5.1 and 9.1/3 were probably created by breakage of a single ancestral segment of the genome of the bony vertebrate ancestor. If 5.1 and 9.1/3 are virtually merged, then all remaining pairings of human segments from CLG 1 show a significant excess of ancient paralogues, consistent with amplification to four through two successive duplications.

To obtain a genome-wide view of the history of chromosomal evolution on the vertebrate stem, we applied a similar analysis systematically to the 17 CLGs by exhaustively evaluating all partitionings of human genome segments, and using a parsimony criterion to identify the most likely reconstruction. The most parsimonious partitionings of human segments into paralogy groups is summarized in Supplementary Table 1, and is diagrammed in Fig. 3. This analysis shows that most of the human genome (112 segments spanning 2.68 Gb, or 95% of the euchromatic genome) was affected by large-scale duplication events on the vertebrate stem before the bony vertebrate radiation (that is, the teleost/tetrapod split), and that nearly all of the ancient chordate chromosomes were quadruplicated (Supplementary Fig. 9).

Figure 3: Quadruple conserved synteny.
figure 3

Partitioning of the human chromosomes into segments with defined patterns of conserved synteny to amphioxus (B. floridae) scaffolds. Numbers 1–17 at the top represent the 17 reconstructed ancestral chordate linkage groups, and letters a–d represent the four products resulting from two rounds of genome duplication. Coloured bars are segments of the human genome, shown grouped by ancestral linkage group (above), and in context of the human chromosomes (below).

This pattern of genome-wide quadruple conserved synteny15 definitively demonstrates the occurrence of two rounds of whole-genome duplication (2R) and provides a comprehensive reconstruction of the evolutionary origin of the human chromosomes (and those of other jawed vertebrates) through these duplications on the vertebrate stem. This characterization extends previous lines of evidence for whole-genome duplication events based on comparative studies of specific regions of interest across chordate genomes (for example, the Hox cluster25 and the major histocompatibility complex region28,46) and the analysis of vertebrate gene families (reviewed in ref. 29), as well as the identification of paralogous segments and chromosomal relationships within the human genome10,13,14,47. A manual, phylogeny-based analysis of the four scaffolds making up the NK homeobox-containing paralogon was in agreement with these results (Methods).

Timing of events on the vertebrate stem

The amphioxus–human synteny analysis presented here demonstrates that two rounds of whole-genome duplication occurred on the vertebrate stem after the divergence of cephalochordates but before the split of teleosts and tetrapods. The next question is whether these two genome-scale duplications happened in rapid succession or even effectively simultaneously, or were separated in time15. We sought to resolve the 2R events relative to the divergence of cartilaginous fish, urochordates and jawless vertebrates (for example, lamprey).

Sample sequencing from the elephant shark Callorhynchus milii, for example, demonstrates significant conserved macrosynteny between cartilaginous fish and humans, because pairs of genes that are 35–40 kb apart in the elephant shark genome are also linked on the human genome48. These links occur predominantly within the human segments defined above, indicating that the orthologous chromosome segments are also found in the elephant shark genome (Supplementary Note 9). Furthermore, previous analysis of phylogenetic topologies dated all duplications before the split between cartilaginous and bony vertebrates49. Therefore, 2R occurred before this split48. Similarly, the preservation of several CLGs as intact single chromosomes in C. intestinalis (Supplementary Fig. 14) implies that both rounds of duplication occurred after the divergence of the urochordate lineage.

Sequencing of the repeat-rich lamprey genome has not generated enough long scaffolds to permit large-scale analysis of synteny50. To infer the timing of 2R relative to the divergence of the lamprey lineage, we generated a set of 50,000 ESTs from the sea lamprey Petromyzon marinus (Supplementary Note 3) and analysed the phylogenetic topology of 358 gene families that include pairs of synteny-confirmed human paralogues produced during 2R. The results are summarized in Table 1, along with a parallel analysis using sea squirt and pufferfish for comparison. We find that 58% of the resolved four-gene phylogenies place the lamprey gene closer to one of the human paralogues; this is similar to the results of ref. 51 but analyses a tenfold larger set of gene families (Supplementary Fig. 2). This result is clearly distinct from that expected if the lamprey lineage diverged either much before (as for sea squirt) or after (as for pufferfish) 2R. The remaining scenarios are either that the jawed vertebrate and lamprey lineages diverged in the period between two well-separated whole-genome duplications, or that one, or both, of the 2R whole-genome events occurred nearly coincident with the lamprey lineage divergence. The time interval that distinguishes ‘nearly coincident’ from ‘well-separated’ is determined by the process of rediploidization, during which most gene duplicates are lost and the sequences of the surviving paralogues diverge9,15.

Table 1 Timing of whole-genome duplications

Karyotypic changes in the vertebrate and tunicate lineages

From the 17 ancestral chordate linkage groups, 2R nominally produced 17 × 4 = 68 proto-vertebrate segments, although this naive inference assumes that, first, all duplicated segments were retained and, second, no fusions, fissions or additional segmental duplications occurred during 2R. Some 2R-produced segments, however, are consistently linked in contemporary bony vertebrate genomes (for example, 12b and 1d, which co-occur on human chromosome 1, chicken chromosome 8 and stickleback linkage groups III and VIII), indicating a fusion before the bony vertebrate (osteichthyan) ancestor. We found evidence for at least 20 such fusions (Supplementary Note 8). Allowing for a range of nearly parsimonious reconstructions of 2R, we estimate that the bony vertebrate ancestor had between 37 and 49 chromosomes. Additional fusions on the teleost stem reduced this number to 12 (refs 52–56) before the teleost-specific genome duplication. On the tetrapod stem, the chicken and human genomes share 4 fusions of bony vertebrate segments, suggesting 33–45 chromosomes. These are consistent with recent estimates based on intra-vertebrate comparisons47,52,53,54,55,56.

Ancient developmental gene linkages

The amphioxus genome has also retained ancient local gene linkages (micro-synteny) in addition to conserved macro-synteny. In some cases, local linkages are even older than the chordates, and date back to the bilaterian ancestor or earlier. As an example, we considered gene families that expanded by tandem duplication early in animal evolution, specifically, the Antennapedia (ANTP) and Paired (PRD) classes of homeobox genes and the Wnt gene family. We examined how frequently these genes are still neighbours in the amphioxus genome, and discovered five new examples of ancient pairs or clusters of ANTP or PRD genes: (1) Otx and goosecoid, (2) Mnx and ro, (3) Nkx2-1 and Nkx2-2, (4) Nkx6, Nkx7, Lbx and Tlx, and (5) En, Nedxa, Nedxb and Dll (Supplementary Table 3). These gene pairs or clusters (along with the well-known Hox, ParaHox and NK linkages) originated by tandem duplication before the divergence of bilaterians, yet their tight linkage has not been disrupted by genome rearrangement. The Nkx2-1/Nkx2-2 gene pair has been retained in vertebrates (and is duplicated), but in every other example the tight linkage (clustering) has been lost in the human genome. None of the five newly described examples are retained in the Drosophila melanogaster genome. The situation in the Wnt gene family is a little different, because both amphioxus and Drosophila have retained tight linkages that have been disrupted in the human lineage owing to genome duplication followed by differential gene loss (Fig. 4). These results underscore the fact that the amphioxus genome has undergone less genomic rearrangement than the human and other vertebrate genomes since their shared ancestor more than half a billion years ago.

Figure 4: Ancient developmental gene linkages.
figure 4

In the amphioxus genome, Wnt6, Wnt1 and Wnt9 form a compact gene cluster 2.5 Mb from Wnt10 and Wnt3 but all are on scaffold 149. Orthologues of four of these genes are also clustered in Drosophila (not shown), although Wnt3 has been lost, as inferred from its presence in cnidarians. The human orthologues are on four chromosomes, and show disruption of gene clustering through duplication followed by gene loss. Linkages of Hox clusters to three of the human loci gives additional support for the large-scale duplication events involved. The three clusters that are linked to Hox clusters (as well as the four Hox clusters) fall in chromosome segments grouped in CLG 16, as does the Hox-bearing amphioxus scaffold. Genes drawn as boxes above the horizontal lines are transcribed from left to right; genes depicted below the lines are transcribed from right to left.

Impact of whole-genome duplications

How many duplicate genes survive in modern vertebrate genomes from the two genome-wide events? Twenty-five per cent (2,131) of the ancestral chordate gene families (out of the 8,437 indicated above) have two or more ancient vertebrate paralogues (‘ohnologues’) that were evidently produced by ancient gene duplication(s) after the divergence of amphioxus (see also Supplementary Fig. 3). Of these, 1,489 (70%) are embedded within paralogous segments from our reconstruction of 2R, as portrayed in Fig. 3, and were plausibly created through 2R. These retention rates for 2R-duplicated genes are comparable to other estimates based on large-scale gene phylogenies10,14,57,58. Similar retention rates were found for the 350-million-year-old teleost-fish-specific duplication55,59,60,61 and were estimated for the 40-million-year-old genome duplication found in the frog Xenopus laevis62.

Gene duplicates from 2R that have been retained in modern genomes are significantly enriched for functions associated with signal transduction, transcriptional regulation, neuronal activities and developmental processes (Supplementary Table 18, Methods). For example, genes implicated in signal transduction are more than twice as likely to be retained in two or more copies from 2R compared to the overall retention rate. These results are consistent with the hypothesis that paralogues created by whole-genome duplication were recruited for roles in the development of novel features of vertebrate biology, and with similar biased retention in teleost fishes61. Whole-genome duplications, however, may have allowed entire molecular pathways to be duplicated and sub-functionalized coincidently (reviewed in ref. 63). Whereas similar numbers of gene duplicates are found in amphioxus relative to the chordate ancestor, different gene classes have been expanded, and the mechanism of gene duplication is different (Supplementary Note 7).

Conserved non-coding sequences

Inspired by the extensive conserved synteny between amphioxus and vertebrates, we searched for conserved non-coding sequences that might reflect ancient chordate cis-regulatory elements. Genome-scale comparisons between mammals and teleost fish have revealed up to 3,100 conserved non-coding sequences, most of which function as tissue-specific enhancers64,65. At greater phylogenetic distances, no conservation outside of coding sequences and conserved microRNAs66 has been identified so far. By aligning the amphioxus and human genomes (Supplementary Note 10), 77 putative chordate conserved non-coding elements were identified (>60% identity over >50 bp), after excluding transcribed or repetitive sequences and requiring conservation in at least one other vertebrate. Of these, 16 overlap with or are immediately adjacent to the 3′ or 5′ untranslated regions (UTRs) of human genes, and are probably conserved UTR elements. Four are adjacent to exons and probably represent conserved splicing enhancers. A single conserved noncoding element overlapped a highly conserved microRNA gene (mir-10b adjacent to the human HOXD4 gene). The remaining 56 elements are of unknown function, but can be tested experimentally for enhancer activity45.

Conclusions

The amphioxus sequence reveals key features of the genome of the last common ancestor of all chordates through comparison with the genomes of other animals. This ancestor probably lived before the Cambrian period, and gave rise to the chordate lineage that is represented today by modern cephalochordates such as amphioxus, as well as urochordates and vertebrates. Of the living lineages, the cephalochordates diverged first, before the split between the morphologically diverse urochordates and vertebrates. To a remarkable extent, the amphioxus genome appears to be a good surrogate for the ancestral chordate genome with respect to gene content, exon–intron gene structure and even chromosomal organization. The sequences of model ascidians with small genomes are by comparison simplified by gene loss, intron loss and genome rearrangement. Remarkably, modest levels of non-coding sequence have been conserved between amphioxus and human—the oldest conserved non-coding regions yet detected through direct sequence alignment—and may provide a tantalizing glimpse of the gene regulatory systems of the last common chordate ancestor.

Extensive conserved synteny between the genomes of amphioxus and various vertebrates lends unprecedented clarity and coherence to the history of genome-scale events in vertebrate evolution. The human and other jawed vertebrate genomes show widespread quadruple-conserved synteny relative to the amphioxus sequence, which extends earlier regional studies and provides a unified explanation for paralogous chromosomal regions in vertebrates. Our analysis thus provides conclusive evidence for two rounds of complete genome duplication on the jawed vertebrate stem. These genome duplications occurred after the divergence of urochordates but before the split between cartilaginous fish and bony vertebrates. The jawless vertebrates (for example, lamprey and hagfish) represent the only other chordate lineages that survive from this period, and at least the lamprey appears to have diverged between the two rounds of duplication, although the data still allow for an octoploid population as the progenitor of the jawless and jawed vertebrates. The detailed mechanism of these events—in particular, whether they occurred by allo- and/or auto-tetraploidizations, how closely spaced in time they were, and the precise nature of the rediploidization process—remain unknown. Although it is tempting to relate the genome duplications to specific morphological radiations in vertebrate evolution, the fossil record reflects a relatively steady diversification rather than a dramatic discontinuity of stem-group vertebrate forms67.

The genomic features that are associated with organismal complexity, if such generic features exist at all, remain unknown68. It is tempting to speculate, however, that the creation of the ancestral jawed vertebrate genome by two rounds of genome duplication was a formative event in the early history of vertebrates that provided genomic flexibility through the duplication of coding and cis-regulatory sequences for the emergence of familiar developmental, morphological and physiological novelties such as chondrogenic and skeletogenic neural crest cells, the sclerotome (vertebral) compartment of the somites, elaborate hindbrain patterning, finely graded nervous system organization, and the elaborated endocrine system of vertebrates. Indeed, we find that genes involved in developmental signalling and gene regulation are significantly more likely to be retained in multiple copies in living species than genes overall, suggesting that diversified developmental regulation is correlated with the evolution of vertebrate novelties. This begs the question, dating back to Ohno, of how such duplicated genes became integrated into the biochemical and genetic networks of vertebrates. In a separate paper45, we examine vertebrate biology in the light of the amphioxus genome data and the genome-scale duplication events on the vertebrate stem.

Methods Summary

Genome sequencing, assembly and annotation

High-quality sequence Sanger reads (7.3 million) were generated and assembled using JAZZ69. Protein-coding genes were annotated using EST, homology and ab initio methods as described previously42,70.

Deuterostome relationships

Orthologous gene alignments were created using ClustalW71 and Gblocks72, and analysed with bayesian and maximum likelihood methods.

Intron evolution

The presence and absence of an intron at each of 5,337 orthologous coding positions was treated as a binary character in parsimony and bayesian analyses.

Construction of chordate linkage groups

Human chromosome segments and amphioxus scaffolds were clustered by orthologue distribution profile. The null hypothesis (orthologous genes randomly distributed across the two genomes) was evaluated using Fisher’s exact test, with a Bonferroni correction for the total number of pairwise tests.

Decomposition of CLGs into independent products of duplication

The most parsimonious partitioning of the human chromosomal segments assigned to the CLG was obtained using a scoring system that included shared ohnologues and position in the multi-species synteny comparison.

NK quadruple conserved synteny

In addition to the genome-wide synteny analysis, a detailed manual curation was carried out on four v1 scaffolds (56, 124, 185 and 294) that make up the NK homeobox cluster in amphioxus. The 82 amphioxus genes correspond to 111 human genes enriched on chromosomes 4, 5, 8 and 10 (chi-squared test, P 0.001), in agreement with the genome-wide analysis of CLG 7 (Supplementary Note 11).

Ancient developmental gene linkages

Orthology of homeodomain- and Wnt-containing genes was assigned from phylogenetic tree reconstruction using neighbour-joining and maximum likelihood approaches, supported by high bootstrap values.

Online Methods

Genome sequence and assembly

High-quality sequence Sanger reads (7.3 million) were generated and assembled using JAZZ69,73. See Supplementary Note 2 for details of the genomic libraries, assembly methods and validation.

Annotation of protein-coding genes

Protein-coding genes were annotated by the JGI annotation pipeline as previously described42,70. See Supplementary Note 3 for a description of amphioxus-specific details. Gene models representing allelic pairs were identified using a combination of similarity of predicted peptide sequence and gene neighbourhood context (Supplementary Note 3).

Deuterostome relationships

Sets of orthologous genes were collected by grouping together mutual-best BLAST74 hits between N. vectensis (sea anemone) and gene sets from other published deuterostome genomes plus expressed sequence data from the acorn worm Saccoglossus kowalevskii75 and the sea lamprey P. marinus (Supplementary Note 5). Individual multiple alignments were created with CLUSTALW71, were manually reviewed, trimmed with Gblocks72, and concatenated. Orthologue sets (364) had representation from all genomes (alignment 1), and 1,090 had up to one missing (alignment 2). The alignments were analysed by bayesian and maximum likelihood methods, respectively, using MrBayes76,77 and PHYML78. Supplementary Note 5 contains more details of the data sources, data compilation and analysis.

Intron evolution

A collection of 5,337 orthologous well-aligned coding sequence positions that contain an intron in at least one genome was analysed by weighted parsimony analysis with PAUP79 and Bayesian analysis with MrBayes76,77 (Supplementary Note 6).

Chordate gene families

Using predicted proteomes for human, chicken, stickleback, pufferfish, sea squirt, amphioxus, sea urchin, fruitfly and sea anemone (Supplementary Note 5), families (‘clusters’) of orthologous genes were constructed to represent the ancestral gene complements of the tetrapod, teleost, jawed vertebrate, ‘olfactores’ (that is, vertebrates plus urochordates), chordate and deuterostome ancestors, as described previously42, with modifications described in Supplementary Note 7.

Chromosome segmentation

The human, chicken and stickleback chromosomes were segmented iteratively by comparison to one another and to the scaffolds of the Fugu genome assembly. See Supplementary Note 8 for complete details.

Construction of chordate linkage groups

For whole-genome synteny analysis, orthology between genomes was based on c-score (BLAST score/best BLAST score) clustering as previously described42, with a c-score threshold of 0.75 when comparing human and amphioxus, and 0.95 when comparing human to other vertebrates. To define initial CLGs (Supplementary Fig. 64), human chromosome segments and amphioxus scaffolds were clustered using the same method as for chromosome segmentation, with a correlation threshold of 0.25. Statistical significance of orthologue concentration between regions of one genome and another was computed with Fisher’s exact test with a Bonferroni correction for the total number of pairwise tests (see Supplementary Note 8 for additional details).

Fluorescent in situ hybridization

Chromosome preparation was performed as described previously80 with modifications described in Supplementary Note 8.

Multi-species synteny comparison

The clustering protocol described above for human–amphioxus was repeated for human–Fugu, human–stickleback and human–Nematostella to define clusters of scaffolds or chromosome segments (a ‘cluster set’) for each genome based on comparison to human. All pairs of human chromosome segments were compared to each cluster set, and for each set were classified as having conserved synteny to the same cluster (coded with ‘1’), having conserved synteny (only) to different clusters (coded with ‘0’), or having indeterminate conserved synteny if one or both human segments lack significant conserved synteny to any cluster in the cluster set (coded with ‘?’). The complete results are represented as a colour-coded matrix in Supplementary Fig. 9.

Identification of ohnologues

Operationally, we define a pair of human genes as ohnologues (paralogues descendent from 2R in vertebrate evolution) if they are found in the same chordate gene family (excluding large gene families with more than ten members) and are ancient paralogues differing by more than 0.2 transversions per site at synonymous positions. (We use transversions rather than the more common total substitutions because transversions occur more slowly and therefore show less saturation at the timescales of interest.)

Decomposition of CLGs into independent products of duplication

For each CLG, all partitionings of the human chromosomal segments assigned to the CLG were tabulated. Each partitioning was assigned a score as follows: +1 for each pair of segments from different partitions with a significant number of predicted ohnologues between them; -1 for each pair of segments from different partitions without a significant number of ohnologues between them. Among the partitionings with the maximum score, ties were broken by using the multi-species synteny comparison results: a score of +ε for each pair of segments from different partitions coloured red or orange, and a score of -ε for each pair of segments from different partitions where multi-synteny comparison indicates the two segments were one segment in the jawed vertebrate ancestor (coloured blue or purple in Supplementary Figs 9 and 10), where epsilon is a positive number much less than 1.

Timing of genome duplications

Genes from pufferfish, lamprey and sea squirt (‘X’) were aligned to pairs of human ohnologues (‘Hs1’ and ‘Hs2’) and the orthologous amphioxus gene; phylogenetic position was considered resolved if it has at least 50% maximum likelihood bootstrap support (Supplementary Note 9 and Supplementary Fig. 11).

Ancient developmental gene linkages

Genes were identified by tBLASTn against version 1.0 of the B. floridae genome assembly with vertebrate and invertebrate homeodomain and Wnt sequences. Orthology was assigned from phylogenetic tree reconstruction using neighbour-joining and maximum likelihood approaches. Support for nodes was assessed by bootstrapping; all gene families were recovered with high support. Human data from Ensembl release 47 were used. Supplementary Table 3 lists the genes and gene models examined.

Functional categorization of retained duplicate genes

PANTHER functional annotations were mapped to inferred ancestral chordate genes, and subsets of these genes were analysed for enrichment in functional categories by methods previously described for the analysis of ancestral eumetazoan genes42. Because functional annotations overlap, the category of ‘developmental processes’ is itself dominated by genes associated with signal transduction and transcriptional regulation.

Conserved non-coding element and expression analysis

For conserved non-coding element and expression analysis, see Supplementary Note 10.