Main

Schistosomiasis is a neglected tropical disease that ranks with malaria and tuberculosis as a major source of morbidity affecting approximately 210 million people in 76 countries, despite strenuous control efforts1. It is caused by blood flukes of the genus Schistosoma (phylum Platyhelminthes), which exhibit dioecy and have complex life cycles comprising several morphologically distinct phenotypes in definitive human and intermediate snail hosts. Schistosoma mansoni, one of the three major human species, occurs across much of sub-Saharan Africa, parts of the Middle East, Brazil, Venezuela and some West Indian islands. The mature flukes dwell in the human portal vasculature, depositing eggs in the intestinal wall that either pass to the gut lumen and are voided in the faeces, or travel to the liver where they trigger immune-mediated granuloma formation and peri-portal fibrosis2. Approximately 280,000 deaths per annum are attributable to schistosomiasis in sub-Saharan Africa alone3. However, the disease is better known for its chronicity and debilitating morbidity4. A single drug, praziquantel, is almost exclusively used to treat the infection but this does not prevent reinfection, and with the large-scale control programmes in place, there is concern about the development of drug resistance. Indeed, resistance can be selected for in the laboratory and there are reports of increased drug tolerance in the field5.

In this study we present the sequence and analysis of the S. mansoni genome. Previous metazoan projects have been restricted to Deuterostomia (for example, Homo, Mus and Ciona) and the ecdysozoan clade of the Protostomia (for example, Drosophila, Caenorhabditis and Brugia). Together with the accompanying article on S. japonicum6, we present, to our knowledge, the first descriptions of metazoan genomes from the lophotrochozoan clade. The genome reveals features that aid our understanding of the evolution of complex body plans. We have mined the genome to predict new drug targets, on the basis of searches involving traditional areas for drug discovery, metabolic reconstruction, and bioinformatics screens that exploit shared pharmacology. It is hoped that these and other targets will accelerate drug discovery, generating the much needed new treatments for the control and eradication of schistosomiasis.

Genome structure and content

The nuclear genome sequence of S. mansoni was determined by whole-genome shotgun sequencing and assembled into 5,745 scaffolds greater than 2 kilobases (kb) (Supplementary Table 1), totalling 363 megabases (Mb). Although 40% of the genome is repetitive, 50% is assembled into scaffolds of at least 824.5 kb. Furthermore, 43% of the genome assembly (distributed over 153 scaffolds) was unambiguously assigned to chromosomes (seven autosomal, plus ZW sex-determination pairs) using fluorescence in situ hybridization (FISH; Fig. 1, Supplementary Fig. 1 and Supplementary Table 2).

Figure 1: Physical map of S. mansoni.
figure 1

a, b, Idiogram of S. mansoni chromosomes W, Z (a) and 3 (b). S. mansoni BAC clones were mapped to the karyotype of S. mansoni by FISH. The solid black areas are heterochromatin, the open areas are euchromatin. The BAC clones are identified by BAC numbers. cf, Chromosome spreads with FISH-mapped BACS are shown. FISH-mapped BACS are identified by arrowheads on labelled chromosomes. Scale bar, 10 μm. See Supplementary Fig. 1 for idiograms of all S. mansoni chromosomes.

PowerPoint slide

We identified 72 families of both long-terminal repeat (LTR) and non-LTR transposons, comprising 15% and 5% of the genome, respectively, and containing 63 and 60 new families each (Supplementary Table 3). The LTR transposons are from the Ty3/Gypsy and BEL clades, whereas the non-LTR transposons are restricted to the RTE, CR1 and R2 clades. Two previously described non-LTR retrotransposon families from the RTE clade (SR2 and Perere-3)7,8 seem to have undergone a burst of transposition events after divergence of S. mansoni and S. japonicum, and contribute to an overall higher representation of non-LTR retrotransposons in S. mansoni (15%, around 8% in S. japonicum). A new DNA transposon belonging to the Mu family was also found, which represents the first instance in a flatworm. The presence of target site duplications in some copies indicates recent transposition, and suggests that active copies may still exist in the genome. A lack of terminal inverted repeats, a feature of Mu family members, suggests a peculiar mechanism for recognition of this element by the transposition apparatus.

We identified 11,809 putative genes encoding 13,197 transcripts. Considering genes that do not span a gap, the average gene size is 4.7 kb, typically with large introns (the average is 1,692 base pairs (bp)) and much smaller exons (the average is 217 bp). Moreover, the introns show a markedly skewed size distribution that has not been observed in other eukaryotes, whereby 5′ introns are smaller than 3′ introns (Fig. 2, Supplementary Information and Supplementary Table 5). In multi-exon genes, the first few introns can be as small as 26 bp, whereas introns towards the 3′ end are typically kilobases in length (the largest is 33.8 kb). The reason for this is unclear but it suggests unusual transcriptional control. However, a survey of conserved transcription factor domains shows S. mansoni to be broadly similar to other eukaryotes (Supplementary Information, Supplementary Fig. 2 and Supplementary Table 6). It is noteworthy that 43% of transcription factor families with schistosome representatives also contained vertebrate sequences, nearly twice the number that matched nematode worms, emphasizing their evolutionary distance.

Figure 2: Intron size distribution.
figure 2

The length of introns varies according to their position in a transcript, counting from the 5′ end (solid circles) and the 3′ end (open circles). Mean lengths ± standard errors are shown. After about five introns, the length difference is no longer apparent owing to the variation in the number of introns per transcript (see Supplementary Information).

PowerPoint slide

Micro-exon genes

At least 45 genes have an unusual micro-exon structure. Individual micro-exons have been described in other genomes, dispersed among several normal exons9. However, S. mansoni is notable in containing micro-exon genes (MEGs) that comprise 75% of the coding sequence, are flanked at the 5′ and 3′ extremes by conventional exons, and have lengths that are multiples of three bases (from 6 to 36).

Other than having shared gene structure, no similarity could be detected between 14 MEG families (each with up to 23 members; Fig. 3 and Supplementary Table 7). Moreover, they showed no similarity with annotated genes from outside Schistosoma spp., nor any identifiable motifs or functional domains. Comparisons between MEG family members and related proteins from S. japonicum suggest that some gene duplication events preceded the divergence of the two species. Almost all encode a signal peptide at the 5′ end and three have membrane anchors, so most are probably secreted. Examination of the large expressed-sequence tag (EST) data set from across the life cycle shows that genes from all MEG families are transcribed in the intramammalian stages of the life cycle, and the germ balls of daughter sporocysts that develop into infective cercariae, but probably not in miracidia that infect the snail intermediate host (Fig. 3).

Figure 3: Schematic representation of gene structure from MEG family members.
figure 3

a, Structure of a representative member from each MEG family. Where several members were found, the total number detected is indicated in parentheses. Each box represents an exon drawn to scale, and the number above it indicates the exon size in nucleotides. For illustrative purposes, the introns are shown with fixed length. Black triangles and diamonds indicate exons encoding predicted signal peptides and transmembrane helices, respectively. Other characteristics associated with exons are indicated by colour and grouped as follow: micro-exons having lengths that either are multiples of 3 bp (red) or are indivisible by 3 bp (orange); exons longer than 36 bp and having lengths that either are multiples of 3 bp (blue) or are indivisible by 3 bp (green); putative initiation and termination exons (grey); untranslated region (UTR) (black). The asterisk indicates an exon deduced from transcript data, which did not match the sequenced genome. MEG-12 and MEG-13 structures were only partially predicted owing to the lack of transcripts containing the 5′ end of these genes. b, PCR with reverse transcription (RT–PCR) or EST-based evidence of transcription (black box) for each family across different life-cycle stages. C, cercaria; E, egg; G, germball; M, miracidium; 3s and 7s, 3- and 7-day schistosomula; 21li and 28li, 21- and 28-day liver worms; 45a, 45-day adult worm pairs.

PowerPoint slide

Sequencing of transcripts from three MEG families revealed the occurrence of several alternative splice variants formed by exon skipping. In one of the families analysed, all internal exons except those coding for the signal peptide were missing in at least one transcript sampled, and a gene from a second family presented different transcripts with extended exons produced by the use of alternative splicing sites. These observations suggest that a ‘pick and mix’ strategy is used to create protein variation.

Evolution of triploblasty, parasitism and tissues

Schistosomes are the first Platyhelminthes to be fully sequenced, and provide insights into the evolution of ‘simple’ animals. Using Treefam to make comparisons with the sea anemone Nematostella vectensis, a representative of the Radiata, we sought gene families restricted to, or expanded in, the Bilateria (Supplementary Table 8). The advent of a third germ layer in flatworms is paralleled by the expansion of genes encoding cell adhesion molecules such as cadherins. Similarly, tissue-patterning developmental cues (for example, Notch/Delta) and histone-modifying enzymes (for example, histone acetyltransferases) have proliferated. Some genes, such as the tetraspanins that encode membrane structural proteins, have greatly proliferated in schistosomes, suggesting a critical role in worm physiology/parasitism. The large array of paralogues for fucosyl and xylosyltransferases involved in the generation of new glycans expressed at the host–parasite interface may be important for subverting the immune system. The expansion of proteases in schistosomes also seems to be directly related to parasitism, because it includes families involved in host invasion (invadolysins) and blood feeding (cathepsins). Furthermore, G-protein-coupled receptors (GPCRs) show varying levels of contraction in schistosomes, whereas several classes (for example, peropsins) are greatly expanded in Nematostella, indicating functions associated with the free-living lifestyle.

Although schistosomes are acoelomate, they possess tissues approaching the sophistication of organs—such as gut, nephridia, nerve and muscle—that are concerned with discrete physiological processes, such as feeding, excretion and locomotion. However, as lophotrochozoans they are evolutionarily distant from the previously sequenced parasitic nematodes Brugia10 and Meloidogyne11,12 (both ecdysozoans). Compartmentalisation of schistosome tissues and the formation of epithelial barriers are crucial for life in the hostile environment of the host bloodstream. Schistosomes possess the typical machinery of higher metazoa to interact with the cytoskeleton and control cell polarity (Supplementary Information and Supplementary Table 9), organize epithelia and denote tissue boundary lines.

S. mansoni possess a nervous system that includes an anterior brain and longitudinal nerve cords, which extend from the brain to run the length of the worm body. Furthermore, a variety of sensory structures (at least six types in the cercaria13) are able to transduce a wide range of stimuli that assist in host location, penetration and navigation through the vasculature. In common with more complex organisms, schistosomes possess the tools needed to mediate neurogenesis and control axon growth cones and migration of neural cells (Supplementary Information and Supplementary Table 9), supporting the ancient origins of neural complexity.

Insights into possible new drug targets

Historically, anti-schistosomiasis agents were identified by in vivo screening in animal models. The S. mansoni genome project makes a more target-based approach to drug discovery feasible, and some promising leads have already emerged. These include a family of nuclear receptors14 (Supplementary Information) and a redox enzyme, thioredoxin glutathione reductase, recently validated as a drug target15. The condensed redox biochemistry of S. mansoni, relative to its human host, may offer further drug development targets (Supplementary Information). In the context of drug discovery, we have explored other potential areas of vulnerability, including: lipid metabolism, GPCRs, ligand- and voltage-gated ion channels, kinases, proteases and neuropeptides. We also undertook two bioinformatics-led approaches: metabolic reconstruction to identify chokepoints, and sequence searches for structures related to known drug targets.

Lipid metabolism

S. mansoni contains a full complement of genes required for most core metabolic processes, such as glycolysis, tricarboxylic acid cycle and the pentose phosphate pathway. However, schistosomes are incapable of de novo synthesis of sterols or free fatty acids and must use complex precursors from the host16. An extensive lipid-carrying protein repertoire could be identified, but despite producing precursors for fatty acid synthesis, fatty acid synthase could not be identified. An inability to use isoprene products of the mevalonate pathway probably accounts for the lack of sterol biosynthesis (Supplementary Table 11 and Supplementary Information). The genes necessary for a complete β-oxidation pathway are present, and this usually inactive pathway might operate in reverse to perform syntheses17. Despite constituting 40% or more of the lipid content of adult worms16, triacylglycerol has an uncertain role in the schistosome’s life cycle—it is slow to turn over, does not contribute to the formation of other lipids16 and its use as an energy store is doubtful17. Nevertheless, S. mansoni possesses lipases capable of breaking down triacylglycerol, so it may have functions other than preventing too high concentrations of intracellular fatty acids16. Pathways responsible for synthesizing the phospholipid components of membranes are well represented, except that phosphatidylcholine must be derived from diacylglycerol18 and the parasite must depend on its host as a source of inositol.

GPCRs, ligand-gated and voltage-gated ion channels

GPCRs, ligand-gated and voltage-gated ion channels are targets for 50% of all current pharmaceuticals19. At least 92 putative GPCR-encoding genes are present (Supplementary Table 12), the bulk (82) of which are from the rhodopsin family. The largest groups are the α-subfamily (30), which includes amine receptors, and the β-subfamily (24), which contains neuropeptide and hormone receptors. The diversity of the former subfamily underlines the wide range of potential amine/neurotransmitter reactivities of schistosomes, but the tentative identities assigned need to be confirmed by functional studies, as has already been performed for a histamine receptor20. Schistosomes detect chemosensory cues, but a large, unique clade of the mediating receptors was not found. However, the 26 ‘orphan’ rhodopsin family GPCRs may include proteins with this role. Outside the large rhodopsin family, representatives from each of the smaller families of GPCRs, glutamate family (2), frizzled family (3), and the secretin/adhesion family (4) are present.

Each of the three major ligand-gated ion channel families—the Cys-loop family, glutamate-activated cation channels, and ATP-gated ion channels—are represented in the schistosome genome. Of the 13 Cys-loop family ligand-gated ion channels, nine encode nicotinic acetylcholine receptor subunits (Supplementary Fig. 4 and Supplementary Table 13). The remaining four anion channel subunits group among GABA (γ-aminobutyric acid), glycine and glutamate receptors, but it is not possible to assign precise identities. The seven schistosome glutamate-activated cation channels comprise at least two sequences from each of the three common sub-groupings. The presence of a functional P2X receptor for ATP-mediated signalling in schistosomes was already known21, and the data here show at least four more.

Voltage-gated ion channels generate and control membrane potential in excitable cells, and are central to ionic homeostasis. There are examples of successful drugs targeting voltage-gated sodium, potassium and calcium channels22. Although voltage-gated sodium channels were not found, at least 41 members from each of the major six transmembrane (6TM) and four transmembrane (4TM) families of potassium channels (Supplementary Table 14) are present. The 6TM voltage-gated potassium channel family (20 members) is the largest, including the well-characterized Kv1.1 channel found in nerve and muscle of adult schsitosomes23. Other classes of 6TM potassium channels include the KQT channels, large calcium-activated channels, small calcium-activated channels, and cyclic-nucleotide-gated groups. This last group, comprising eight members, is most often associated with signal transduction in primary olfactory and visual sensory cells (Caenorhabditis elegans has only five; ref. 24). S. mansoni possesses six 4TM inward-rectifying TWIK-related potassium channels (about 46 in C. elegans). There are four α and two β subunits of voltage-gated calcium channels in schistosomes, and a β subunit is implicated as a molecular target of the anti-schistosomal praziquantel25.

The kinome

Protein kinases are important regulators of many different cellular functions. Both they and their inhibitors have entered the drug development pipeline in recent years26 but few schistosome kinases have been characterized to date. The S. mansoni genome encodes 249 kinases, including 22 genes with alternative splicing (Supplementary Information). This corresponds to 1.9% of the total coding proteins in the genome, a figure comparable to that found in other species27 (Supplementary Fig. 6). S. mansoni possesses representatives of all of the main kinase groups (Supplementary Fig. 7), the largest of which is the CMGC (cyclin-dependent kinases, mitogen-activated protein kinases, glycogen synthase kinase 3 and CK2-related kinases) group, in contrast to other analysed eukaryotic genomes. However, a single class (RCK) is absent from the CMGC family, a deficiency shared with yeast but not nematodes or mammals.

The least represented groups are the casein kinase (CK1) and receptor guanylate cyclase families with only seven and three members, respectively, contrasting with C. elegans, in which casein kinase is the largest group and receptor guanylate cyclase has 27 members. CK1 (and CMGC) group members that are expressed in sperm or during spermatogenesis in C. elegans are missing in S. mansoni.

The degradome

Proteolytic enzymes (proteases), making up an organism’s ‘degradome’28, operate in virtually every biological and pathological phenomenon29 and are proven drug targets in diverse biomedical contexts30,31. All five major classes of proteases (aspartic, cysteine, metallo-, serine and threonine) are represented as various clans (mechanistically related groups) in the parasite genome (Supplementary Table 17). The percentage distribution of the major clans is generally similar to that of the human host with some notable exceptions, mainly owing to the expansion of constituent protease families in humans. Of the 73 protease families, 61 are found in humans and in S. mansoni, and 60 families are shared. With 335 sequences, proteases comprise 2.5% of the putative proteome (Supplementary Table 18), consistent with the proportion in other organisms (1–5%), but this is only one-third of that in humans (945 sequences, if A2 family retrovirus and retrotransposon proteases are included).

The greatest difference between host and parasite is in the paucity of chymotrypsin-like S1 family enzymes in the latter (22 versus 135 human sequences). This reflects the evolution and diversification of family S1 for complex and highly regulated proteolysis cascades in vertebrates and some invertebrates, such as innate immunity, development, blood coagulation and complement activation32,33,34. From a therapeutic standpoint, the reduced complexity may prove valuable with fewer parasite proteases available for essential life-sustaining functions. For example, robust drug discovery programmes are in place for chymotrypsin-like S1 families35 and peptidase C14 (caspases)36, on which anti-schistosomal drug discovery could ‘piggy-back’37. It is also notable that a smaller number of schistosome protease families (for example, C1, M8 and M13) have more members than the respective families in humans. C1 proteases are involved in nutrient digestion by the parasite, which contrasts with the S1 enzymes used in the host. This disparity has already been exploited for a promising anti-schistosome therapy38. One protease family (C83) is apparently unique to S. mansoni.

Apart from the degradome, but involved in its modulation, 34 protease inhibitors were found (Supplementary Table 19). Most of these are serine protease inhibitors belonging to families I2 (Kunitz-type) and I4 (serpins). Two inhibitors of cysteine proteases (cystatins39,40) and two α-2-macroglobulin homologues (I39) were also identified, as were three inhibitor of apoptosis proteins (I32), one of which is highly expressed in adults, where it may function to regulate one or more of the four schistosome caspases.

Neuropeptides

Thirteen putative neuropeptides were identified (Supplementary Table 20), indicating that schistosomes may have much greater diversity than the two described previously. Apart from the neuropeptide Fs (NPFs), most are apparently restricted to the Platyhelminthes—their absence from humans making them a credible source of anthelmintic drug leads. The predicted product of npp-6 (the amidated heptapeptide AVRLMRLamide) resembles molluscan myomodulin, whereas the two NPP-13 peptides show 100% carboxy-terminal identity with vertebrate neuropeptide-FF-like peptides (peptides ending with a C-terminal sequence PQRFamide); neither of these has previously been reported in any non-vertebrate organism. The discovery of a second NPF (NPP-21b) as well as the known NPP-21a41 is reminiscent of the vertebrate neuropeptide Y (NPY) superfamily, and strengthens the argument that NPFs and NPYs have a common ancestry.

Metabolic chokepoints

A chokepoint analysis of metabolic pathways reconstructed from the S. mansoni genome was used to identify further targets. A total of 607 enzymatic reactions could be placed in pathways, and 120 of these enzymes were identified as chokepoints (Supplementary Table 21). The list of chokepoints includes many that are drug targets in other organisms as well as target reactions already characterized in S. mansoni, validating the approach (Supplementary Information). The list also contains new candidate targets and comprises approximately 1% of the S. mansoni proteome.

Chemogenomics screening

In the context of neglected tropical diseases and with constrained investment in drug discovery, piggy-backing37 or ‘drug-repositioning’ strategies42 that re-use existing drugs offer potential time-saving and cost benefits. We adopted a twofold strategy to find significant matches between proteins from the parasite and known ‘druggable’ protein targets of the human host and human-infective pathogens. Using conservative parameters of >50% sequence identity over >80% of the target, we first performed a similarity search against a database of targets curated from medicinal chemistry literature. This revealed 240 distinct S. mansoni transcripts with matches to targets against which there are high quality compounds (Supplementary Table 22). Given the need for short-course, oral therapies against schistosomiasis, this list was further reduced to 94 S. mansoni targets by filtering for potency and predicted bioavailability. A second search, against a database of the targets for human-directed drugs, showed 66 significant matches with pharmaceuticals marketed at present (Supplementary Table 23), corresponding to 34 S. mansoni targets (26, after representing multicopy genes as a single instance; Table 1). For instance, disulfiram, for controlling substance abuse, was highlighted as a potential anti-schistosomal drug; its anti-parasite properties have already been investigated43. Manual inspection of the list for compounds with side effects and toxicity can further refine choices—for example, by eliminating the immunosuppressants, cyclosporin and rapamycin. The remaining known drugs could be directly tested in animal models, and either applied unmodified in anti-schistosomal therapy, or could serve as leads for further optimisation. Widening the search beyond the initial strict criteria would expand opportunities, for example, topoisomerase 1 is retrieved below our initial threshold, at 71% identity but only 58% overlap.

Table 1 S. mansoni genes that match a human gene with marketed drugs

Conclusion

A century after Louis Sambon first named the species in 1907 (ref. 44), the sequencing of the S. mansoni genome is a landmark event. The sequence provides the scientific community with several avenues to study this under-researched human pathogen, and will drive future evolutionary, genetic and functional genomic research. Not least, given that just one drug is widely available to treat schistosomiasis at present, the genome sequence, including the genome-mining analysis presented, offers the possibility that new drug candidates will soon be identified.

Methods Summary

Mixed-sex cercariae from the Puerto Rico isolate of S. mansoni45, released from infected Biomphalaria glabrata snails, were placed in low-melting agarose plugs and genomic DNA was prepared by standard methods. Approximately sixfold coverage of the nuclear genome was obtained using a whole-genome shotgun sequencing approach, in which libraries of different cloned insert sizes (in plasmid, fosmid and bacterial artificial chromosomes (BAC) vectors) were randomly sequenced by Sanger technology from either end. Sequence reads were assembled, and scaffolds were FISH-mapped to individual chromosomes where possible (Supplementary Table 2). The output of several gene prediction algorithms, trained using 409 manually curated gene structures, were integrated into a single set of gene predictions (version 4), which were used for subsequent analyses. Data were accessed from GeneDB (http://www.genedb.org), and Artemis was used for manual annotation and curation of a further 958 genes during subsequent analyses (as described previously46).

Online Methods

Genome sequencing, assembly and mapping

The most commonly used Puerto Rican strain of S. mansoni45 was maintained in albino B. glabrata snail and NMRI mice and golden hamsters as laboratory hosts (Mesocricetus auratus). Cercariae released from infected snails were resuspended in PBS at a concentration of 5 × 105 cercariae ml-1. The parasites were transferred to a 42 °C water bath, incubated for 5 min, and mixed with an equal volume of 1.2% low-melting point agarose (Gibco-BRL) in PBS at 42 °C. The agarose/cell mixture was transferred to a disposable plug mould (Bio-Rad), placed on ice, treated twice for 24 h at 50 °C with 1% N-lauroyl sarcosine, 0.5 M EDTA, pH 8.0, 2 mg ml-1 proteinase K (Boehringer Mannheim). Proteinase K was then inhibited by a 30-min treatment with PMSF (40 μg ml-1), followed by three successive 30-min dialyses against 10 mM Tris-HCl, pH 8.0, 0.1 mM EDTA. Sequencing libraries were constructed using genomic DNA extracted from mixed-sex cercariae. Sequencing reads were produced from small insert plasmid clones containing a range of insert sizes. In addition, 12,305 BAC end sequences from S. mansoni BAC library Sm1 (DDBJ/EMBL/GenBank accession numbers BH199420–BH211620), 19,136 CHORI 103 BAC end sequences (DDBJ/EMBL/GenBank accession numbers DX983724–ED003998) and 16,628 fosmid end reads were included. After filtering out low quality reads, 85% of the remaining 3.19 million reads were assembled using Phusion47 into 381 Mb, or 363 Mb after filtering small (<2 kb) contigs (see Supplementary Table 1). The size is greater than the previously estimated size of 270 Mb48, although this size estimate can be revised to 300 Mb because the original measurements were made using the E. coli genome as a control, which has a length that is 10% greater than previously thought. From the assembly, a depth of coverage of sixfold was calculated.

A physical map was generated using FISH to localize S. mansoni BACs to the seven autosomal and sex pairs of chromosomes using previously published methods49. Clones from two BAC libraries, Sm1 (ref. 50) and CHORI-103 (http://bacpac.chori.org/schis103.htm), each constructed from cercarial DNA were randomly picked and subjected to FISH analysis. Owing to the repetitive nature of the schistosome genome, BACs would often hybridize to more than one chromosome. This was in spite of using sheared genomic DNA to block the repetitive sequences. Of the 500 clones analysed, 334 showed unique hybridization patterns (Fig. 1, Supplementary Fig. 1 and Supplementary Table 2). A total of 118 BACs that were FISH-mapped were among those end-sequenced, and 153 scaffolds were assigned to a specific chromosome.

Retroelements analysis

We performed an iterative search of retroelements using the conserved reverse transcriptase domain as previously described51. Elements with higher than 80% nucleotide identity in the reverse-transcriptase region were considered as members of the same family. To obtain an unbiased estimate of abundance for each element in the genome, all the identified families were mapped to the shotgun reads using BLASTN52. The number of bases spanned by the alignment for each element was counted and compared with the total number of bases in the shotgun data to determine their representation in the S. mansoni genome.

Genome annotation and repeat content analysis

A training set (for ab initio gene finding) of 409 genes was manually curated from S. mansoni sequences already within the Uniprot database and manual prediction of highly conserved genes. Further genome-wide gene predictions were made using both EVidenceModeller and PASA53. EVidenceModeller uses an evidence-combining strategy to compute an optimal set of protein-coding gene structures derived from several, often conflicting, sources of gene predictions. The sources of evidence for our annotation of the S. mansoni genome included the following: ab initio gene predictions derived from GlimmerHMM54, TWINSCAN55, and Augustus56; protein sequence homologies to a non-redundant protein database using AAT57; cross-genome sequence homologies between S. mansoni and S. japonicum using PROmer58; spliced genome alignments to ESTs using GMAP59; and repeat regions identified using RepeatScout60 and RepeatMasker (A. F. A. Smit, R. Hubley and P. Green, unpublished observations, http://www.repeatmasker.org). Consensus gene predictions generated by EVidenceModeller were further modified to include annotations of untranslated regions and alternative splicing isoforms for 1,038 genes by applying PASA leveraging the earlier GMAP aligned ESTs. A total of 13,197 transcripts were predicted for 11,809 genes. Of the 30,110 previously described EST clusters, 24,373 map to contigs >1 kb in the current genome assembly. The true number of genes could therefore be as high as 17,500. By parsing BLAST description lines, putative products were assigned to each gene. During the course of subsequent analyses, 958 of these were manually edited using the Artemis annotation tool.

For an unusually large gene, encoding a putative ryanodine receptor spanning 164 kb, 79 of its 93 intron–exon boundaries were confirmed by RT–PCR. Approximately 45% of the S. mansoni genome was found to be repetitive, computed by summing up genomic bases matching known S. mansoni mobile element sequences or repeat family consensus sequences derived from the RepeatScout de novo repeat library. The repeat content was also assessed on the basis of the distribution of random sequences 25 nucleotides in length, 104,028,213 out of 373,600,457 or 28% of bases were repetitive. Note, this value is significantly lower than that of RepeatMasker because the latter allows sequence divergence of up to 20%.

Analysis of putative transcription factors

Profile hidden Markov models (HMMs) of domains present in the proteins that constitute the TRANSFAC eukaryotic transcriptional factor database61 and the DBD DNA-binding domain database62 were used to search the genome of S. mansoni in conjunction with 63 other eukaryotic genomes. The score threshold was defined as the lowest pairwise score among all members of the Pfam family associated to the HMM. The putative transcriptional activator proteins were then clustered on the basis of sequence similarities (BLASTP E value ≤ 10-6 considered significant) using the TRIBE-MCL algorithm63 and an inflation value of 2.0 (ref. 64).

Micro-exon genes

MEGs were predicted as previously described9 with further manual refinement using available S. mansoni EST data (including both published data65 and unpublished data from GenBank/dbEST or ftp://ftp.sanger.ac.uk/pub/pathogens/Schistosoma/mansoni/ESTs/). Further family members were identified by similarity searches against the available supercontigs in the assembly with long flanking MEG exons as query sequences. Signal peptides and transmembrane domains were detected using SignalP66 and TMHMM 2.0 (ref. 67) programs, respectively.

Expression of MEG families at different stages throughout the life cycle was analysed by BLAST searching the sequences of all members of a family against the complete S. mansoni EST data set, which comprises ESTs from the following developmental stages: germball (28,497), cercaria (21,639), 3-day somule (6,122), 7-day somule (41,043), 21-day liver worm (6,044), 28-day liver worm (11,227), 45-day adult worm (59,552), egg (33,674) and miracidium (19,982).

Evolutionary analysis

To identify orthologues and paralogues of S. mansoni genes, we built a standalone version of the TreeFam database (version 7) of animal gene families68,69. For each S. mansoni predicted protein, we identified the top-matching TreeFam ‘clean’ family using HMMER70 (with E ≤ 10-10 as a cutoff). Similarly, the top-matching family was identified for each Nematostella vectensis (release version 1.0)71 and S. japonicum protein. Trees and alignments were built for the families as for the standard TreeFam pipeline. This resulted in trees for 5,829 families that contain S. mansoni, S. japonicum or N. vectensis genes. From these trees, we identified within-species paralogues in the three species, and identified the ancestral taxon in which the duplication that gave rise to each pair of paralogues occurred.

Kinome

A eukaryotic protein kinase domain HMM was built from a manually adjusted alignment of 68 diverse kinase domain sequences from yeast, worm, fly and human that share <50% sequence identity in the catalytic domain. To test the selectivity of the model, it was run against the Uniprot database. Using a P < 0.1 cutoff, the model detected 2,688 putative domains, all of which were annotated either as kinases or putative kinases in different description fields. Local and global HMM models were built with the HMMer package (http://hmmer.janelia.org/) from several sequence alignments generated by MAFFT software72 and were used for sensitive searches against the S. mansoni database.

Identified genes were annotated using Artemis, integrating data from Interproscan and Reverse PSI-BLAST searches73 and the size of the S. mansoni kinome was compared with those of: Plasmodium falciparum74, Homo sapiens27, Trypanosoma cruzi75, Trypanosoma brucei76, C. elegans77, Leishmania major78 and Mus musculus79. A dendogram was constructed using the kinase domains of the identified proteins with the CLC Main Workbench (CLC bio) using the neighbour-joining method with 1,000 replicates.

Identification of putative proteases and inhibitors

We used the MEROPS database80 (http://merops.sanger.ac.uk) to identify active S. mansoni proteases and protease inhibitor homologues, using BLASTP52,73 with E ≤ 10-9 as a cutoff. More distant relatives were identified through HMMER version 2.3.2 (ref. 70) searches of Pfam models81 that corresond to MEROPS families (Pfam version 22.0 (ref. 82), http://pfam.sanger.ac.uk/), using the same E-value cutoff. This initial data set contained 656 provisional homologues, having removed predicted proteases <80 residues in length as well as provisional inhibitors <50 residues long. A secondary screen against the NCBI non-redundant protein database retained a total of 369 S. mansoni sequences, which overlapped in at least 50% of their MEROPS hit or Pfam domain with an experimentally characterized protease or inhibitor homologue. False positives were removed by comparing nonspecific MEROPS description lines (for example, ‘non-peptidase homologues’) to the top non-redundant BLAST hits with an E-value at least 3 logs greater than the top MEROPS or Pfam hit but lacking associated experimental validation. This approach removed MEROPS proteins that are not functional proteases but are structurally related (such as hormone-sensitive lipases in the family S9; flagged as homologues of proteins that are inactive protease homologues in Supplementary Table 18). Similarly, the Pfam database models domains found not only in proteases and inhibitors but also in a wide range of other proteins (for example, PF00047, PF00059, PF00561, PF01476 and PF0764) were also removed as false positives in the absence of further evidence.

We next predicted which of the putative protease homologues were likely to be active or inactive. BLAST alignments of proteins against putative homologues classified in MEROPS predicted active site positions and residues in the S. mansoni query sequence, followed by manual inspection of sequence alignments to refine active site residue predictions. In a few cases, in which an acceptable alignment was not produced by BLASTP of MEROPS, a non-redundant sequence was used. In more difficult cases, involving two closely related S. mansoni sequences, active site residues were identified from multiple alignments of S. mansoni sequences, a representative sequence for the corresponding MEROPS family, and the seed alignment sequences for the relevant Pfam model.

Metabolic chokepoint analysis

An S. mansoni metabolic pathways database, SchistoCyc (http://schistocyc.schistodb.net/ptools), was created using the Pathway Tools software83, which contains algorithms to predict an organism’s metabolic pathways from its genome by comparison to MetaCyc, a reference pathways database84. From the pathway database, potential chokepoint reactions85 were identified (those that uniquely consume a specific substrate or produce a specific product). Chokepoint reactions are probably critical to normal cellular physiology and therefore represent potential drug targets.

Chemogenomics

To identify, in S. mansoni, putative proteins for which therapeutic compound or high quality chemical tools may already be available, sequence similarity searching was performed using BLASTP against the Target Dictionary from Drugstore19 (database of Food and Drug Administration approved drugs) and StARlite (a database of Structure Activity Relationship data abstracted and indexed manually from the primary literature and at present containing 440,055 unique compounds, directed against approximately 3,500 distinct molecular targets from the primary literature). The results were stringently filtered for significance: ≥50% identity, ≥80% overlap of the target and a BLAST E ≥ 10-10. To prioritise 755 hits to StARlite, we applied filters for potency/affinity against the matched target, combined with an estimate of the likelihood that the compound could be orally absorbed. The potency cutoff was set at a half-maximal inhibitory concentration (IC50), inhibition constant (Ki), or dissociation constant (Kd) of 100 nM or better, and oral bioavailability was estimated using the ‘rule of five’ (molecular weight of no more than 500 Da, clogP less than five, no more than five hydrogen bond donors and no more than ten nitrogen or oxygen atoms)86. The drugs associated with matches in the DrugStore database were classified according to a broad range of current therapeutic categories: (1) direct and clear evidence that this interaction is primarily responsible for the therapeutic action of the drug; (2) direct and clear evidence that this interaction is one mechanism for the drug but other targets or mechanisms may exist; and (3) indirect or inferred evidence of the association of the drug, target and therapeutic action.