Main

Studies addressing the molecular determinants of bacterial pathogenicity towards plants have concentrated on a limited number of bacterial species, representing the taxonomic diversity of principal Gram-negative plant pathogens that produce the most common diseases. Among these, R. solanacearum1,2 has unique and significant features. It is a soil-borne pathogen that naturally infects roots. It exhibits a strong and tissue-specific tropism within the host, specifically invading, and highly multiplying in, the xylem vessels. In addition, with over 200 host species belonging to more than 50 botanical families3, this bacterium has an unusually wide host range, and offers a unique opportunity for the analysis of a strong and generalized array of virulence factors. Ralstonia solanacearum has been studied intensively both biochemically and genetically, and has long been recognized as a model system for the analysis of pathogenicity4. It is well adapted to life in soil in the absence of host plants5, thereby providing a good system to investigate functions governing adaptation to such an ecological niche. Ralstonia solanacearum is a β-proteobacterium and thus belongs to a group of bacteria whose genomic organization is still poorly characterized. The only other representative of this group for which the complete genome sequence has been determined and annotated is Neisseria meningitidis6,7.

To date, the genome of only one plant pathogenic bacterium, Xylella fastidiosa, has been sequenced completely8. Here we report the complete nucleotide sequence of the genome of R. solanacearum strain GMI1000, a race 1 strain isolated from tomato. Notably, GMI1000 is pathogenic on the model plant Arabidopsis thaliana, the genome of which has been entirely sequenced9, thereby facilitating studies of host response. We provide an integrative analysis of the predicted functions encoded in this organism with special emphasis on pathogenicity. Our analysis of the genome sequence provides clues to the evolution of pathogenicity functions. The genome sequence of R. solanacearum is a first step towards an exhaustive functional analysis of pathogenicity determinants in this pathogen.

A bipartite genome structure

After sequencing by the whole-genome random sequencing method, we assembled the R. solanacearum genome into two circular molecules: a large replicon of 3,716,413 bp and a smaller 2,094,509-bp replicon, yielding a total genome size of 5,810,922 bp. The two molecules have an almost identical G+C content (67.04% and 66.86% for the large and small replicon, respectively). These two molecules correspond to the two bands that are visualized on pulsed-field gel electrophoresis (see Supplementary Information Fig. 1). The genome encodes a total of 5,129 predicted proteins, and the general features are shown in Table 1.

Table 1 General features of the R. solanacearum strain GMI1000 genome

Analysis of the sequence establishes that the smaller replicon corresponds to the previously described megaplasmid, as it encodes hrp genes that were previously localized on this replicon10. Such a bipartite genome structure seems to be a characteristic of R. solanacearum, as a megaplasmid has been detected in most of the strains belonging to this species11. As no derivative of strain GMI1000 in which the megaplasmid is deleted has been obtained, the status of this replicon as a plasmid is still an open question. The distribution of genes between the two replicons provides some insights into this question.

The origin of replication of the large replicon, as identified by G+C skew analysis12, has features that are characteristic of a chromosomal origin of replication in bacteria (see Supplementary Information Fig. 2). It is flanked by the rnpA gene on one side, and the dnaA, dnaN, gyrAB genes on the other. It also harbours a single consensus DnaA-binding box (TTATCCACAAA), the first nucleotide of which was arbitrarily chosen as the origin for the nucleotide numbering of the large replicon. Furthermore, the large replicon encodes a complete set of essential housekeeping genes including genes required for: (1) DNA replication, DNA repair and cell division; (2) transcription; and (3) translation. This last set includes all the ribosomal proteins, 3 complete ribosomal DNA loci, and 55 identified transfer RNAs, allowing recognition of all possible codons. Finally, all the essential genes required for purine and pyrimidine biosynthesis are located on the large replicon. Therefore, this replicon encodes all of the basic mechanisms required for the survival of the bacterium, and is referred to here as the ‘chromosome’. Accordingly, the smaller replicon may be a dispensable genetic element. The putative origin of replication of this replicon, as predicted by G+C skew analysis, has characteristics of plasmid-borne ori loci. It is flanked by the repA gene and by at least 14 repetitions of a conserved motif (consensus G/CCGTACCCG/ATTTCTGCG) that may be RepA-binding boxes. Therefore, the smaller replicon appears to be a megaplasmid. The first nucleotide of the most upstream RepA-binding box located in the intergenic region preceding repA has been arbitrarily chosen as the origin for nucleotide numbering of the megaplasmid.

The megaplasmid carries several metabolically essential genes that are also present on the chromosome. These include a complete copy of a rDNA locus with 2 tRNA genes, a gene coding for the α-subunit of DNA polymerase III, and a gene for the protein elongation factor G. Moreover, several enzymes controlling primary metabolism, including amino acid and cofactor biosynthesis, are encoded on the megaplasmid with no counterpart on the chromosome. As a consequence, we predict that a megaplasmid-deleted derivative of strain GMI1000 will be auxotrophic for several metabolites, a status similar to that reported for certain internal deletion mutants of the megaplasmid13.

Analysis of the genes present on the megaplasmid suggests that this replicon has a significant function in overall fitness and adaptation of the bacterium to various environmental conditions. As mentioned previously, the megaplasmid carries all of the hrp genes that are required to cause disease on plants, a trait that allows the bacterium to colonize a rather exclusive ecological niche. The megaplasmid also encodes the constituents of the flagellum and most of the genes governing exopolysaccharide synthesis. In addition, this replicon carries 315 out of the 748 genes of unknown function, a proportion significantly biased in favour of the megaplasmic (P = 4×10-7). On the other hand, a significant bias in favour of the chromosome is observed for the R. solanacearum genes that are shared with other bacteria (see Supplementary Information Fig. 3).

Mosaic structure of the genome

The gene prediction software FrameD (http://www.toulouse.inra.fr/FrameD.html) using the probabilistic model constructed on previously characterized R. solanacearum genes, led to a clear-cut prediction of genes in more than 90% of the genome. Using this matrix, a significant portion of the genome (7%) was predicted as non-coding (in regions spanning over 1 kb) although in some instances, BLASTX analysis revealed significant similarities in these regions with known proteins. On the basis of these homologies, an alternative matrix was constructed and used for gene prediction in such regions that we designated alternative codon-usage regions (ACURs). When analysed for base composition, most ACURs, but not all, differ significantly from the average 67% G+C content found for the entire genome, with variations ranging from 50% to 70% G+C content. In addition, codon usage in these regions differs significantly from codon usage in the rest of the genome (Supplementary Information Table 1 and Fig. 4). Furthermore, ACURs were often associated with mobile genetic elements. In 44 out of 93 ACURs, a prophage, insertion sequence or part of an insertion sequence occurred either encoded directly within the ACUR or within the 1-kb flanking region. The strongly biased distribution of genetically mobile elements with ACURs (P = 1.7×10-5) suggests that ACURs may have been acquired through horizontal gene transfer, consistent with the propensity of R. solanacearum to take up and recombine exogenous DNA through natural transformation14.

In addition to ACURs, strain GMI1000 harbours several other elements that may have a function in genetic instability and rapid evolution of the genome (Fig. 1). Throughout the genome, there are at least 118 copies of complete or truncated insertion sequences representing 17 distinct elements belonging to 7 families. Three of these insertion sequences have been identified previously in R. solanacearum (ISRso1, IS1421 and IS1021) (http://www-IS.biotoul.fr) and we have called the remaining 14 ISRso5–ISRso18. There is a preponderance of IS3 (22 copies) and IS5 (27 copies) family members. In addition, there are two, presumably non-autonomous, derivatives of ISRso1, composed uniquely of the terminal sequences separated by an identical 117-bp DNA segment. This is similar to the RUP elements observed in Streptococcus pneumoniae15. At least 4 possibly defective prophages were found on the chromosome together with a conjugative transposon located between positions 2,781,738–2,825,808 (Fig. 1). This transposon is related to the 55-kb transposon Tn4371 from Ralstonia metallidurans16. The R. solanacearum conjugative transposon includes a set of tra and trb genes for conjugation as well as an integrase (RS00926); however, the genes coding for biphenyl resistance in Tn4371 are replaced by a group of genes with undefined functions. There are several loci encoding Rhs and Vgr-related elements17,18, found to be recombinational hotspots in Escherichia coli. Of note, ACURs and mobile genetic elements are not distributed evenly on the genome but are often clustered on both replicons (Figs 1 and 2).

Figure 1: General organization of the R. solanacearum strain GMI1000 genome.
figure 1

The two circular replicons are represented linearly starting from base number 1 (a, chromosome; b, megaplasmid). Kilobases are indicated along the top. For each replicon the distribution of protein-coding genes (line 1), insertion sequences (line 2), ORFs from other genetically mobile elements (line 4) and ACURs (line 3) are represented. The percentage of G+C variation (from the average; red line) along the length of the genome using a 2,500-bp window is shown (line 5). Colours on line 1 correspond to the main classes of genes (the colour code is available at http://sequence.toulouse.inra.fr/R.solanacearum.html). Green triangles point towards the positions of rDNA loci; red stars locate the major bacteriophage remnants; the green star corresponds to the position of the conjugative transposon. The yellow circle indicates the position of the hrp locus and the blue circle indicates the tandem duplication of a 31-kb region. The blue box in a delineates the region of the genome enlarged in Fig. 2.

Figure 2: Enlarged view of part of the chromosome.
figure 2

This figure reveals the mosaic structure of the genome. ACURs (narrow yellow rectangles) alternate with genes from mobile genetic elements (large red rectangles with insertion sequences shown in striped white and red rectangles) and with genes that fit the standard Markov model for R. solanacearum genes. An Ala-tRNA gene is at the end of the last ACUR. This region contains genes encoding two candidate type-III secreted effectors (RS02429 and RS02460), three haemagglutinin-related proteins (RS00436, RS02405 and RS5936) and a Vgr-related protein (RS02437 and RS02439) inactivated by insertion of ISRso1. Colour codes for other genes are given at the bottom of the figure and correspond to the main classes defined in the nomenclature of ref. 46.

The R. solanacearum sequence has a mosaic structure containing numerous elements signalling the potential for evolution. Genomic rearrangements have already been reported to occur naturally in this bacterium13,19 and are further exemplified by the almost perfect tandem duplication of a 31-kb stretch of DNA on the megaplasmid (positions 1,648,630–1,710,832). The present genome sequence may represent a single snapshot of a structure that is variable from isolate to isolate and within derivatives from the same isolate.

Candidate genes responsible for pathogenesis

Apart from the virulence genes already described in R. solanacearum, a series of new genes putatively involved in pathogenicity were identified (Table 2; see Supplementary Information Table 2 for a complete list). These include genes coding for additional hydrolytic enzymes involved in the degradation of plant cell walls, and genes required for the production of the plant hormones auxin and ethylene or for the degradation of the plant signalling molecules ethylene and salicylic acid—a mediator of the plant systemic acquired resistance. Ten genes were predicted to code for proteins involved in resistance to oxidative stress. These 10 genes may be involved in the detoxification of the active oxygen species produced by infected plants, molecules reportedly representing a first line of defence against pathogen invasion20. Genes involved in the production of toxins or antibiotics were also identified: these include six haemolysin-like genes belonging to the RTX toxin family21 and several peptide or polyketide synthase genes. In particular, the two largest open reading frames (ORFs) in the genome (RS05859 and RS05860, coding for 5,953 and 6,889 amino-acid products, respectively) are highly related to the syringomycin synthase gene, which is required for the production of a Pseudomonas syringae toxin22.

Table 2 Known and candidate genes responsible for pathogenesis in the genome of R. solanacearum strain GMI1000

An unusual characteristic of the R. solanacearum genome is that it contains a large number of genes coding for outer-membrane proteins or components of bacterial appendages (pili, fimbriae) implicated in the attachment of the bacterium to external surfaces. We found at least 35 genes, distributed in 5 gene clusters, involved in the biogenesis of a type IV pilus. Type IV pili are known from several other bacterial systems to be adhesion factors, and are responsible for movement of bacteria over epithelial surfaces without the use of flagella23. Furthermore, two other gene clusters are predicted to encode an unusual type of pilus structure that mediates a tight adherence to surfaces similar to those reported recently in Caulobacter crescentus24 and Actinobacillus actinomycetemcomitans25. For all of these pili/fimbriae-coding systems, we found multiple copies of the pilin structural genes. This raises the possibility that these genes are expressed in different contexts or may have slightly different structural roles, thereby broadening the adaptative ability of this wide-host-range pathogen to interact with diverse environmental substrates, including host epidermal surfaces.

Another peculiarity concerning the abundance of adhesion/attachment functions in R. solanacearum is exemplified by a class of surface molecules encoded by long ORFs (9 frames with a coding potential greater than 2,500 amino acids). These translated products share homology with proteins that are adhesins in other bacterial pathogens, the filamentous haemagglutinin (FhaB) of Bordetella pertussis and the HMW1A/HMW2A adhesins of Haemophilus influenzae26. In total, there are 14 probable haemagglutinin-type proteins, and 13 additional ORFs coding for proteins containing variable internal repeats that are structurally related to filamentous haemagglutinins. Ralstonia solanacearum therefore has the greatest number of these haemagglutinin-related proteins of all the completed bacterial genomes.

We have also identified a family of related proteins presenting some degree of similarity to the Agrobacterium tumefaciens proteins AttM and AttZ, both of which are required for attachment to plant cells and for virulence27. The R. solanacearum genome is thus rich in attachment factors, perhaps functioning as determinants for a wide host range.

Effectors dependent on a type III secretion system

Ralstonia solanacearum possesses a cluster of hrp genes encoding a type III secretion system (TTSS) that is essential for pathogenicity. Bacterial TTSSs are conserved among both plant and animal pathogens and translocate effector proteins into the cytoplasm of the host cell4,28. Identification of translocated effectors and establishment of their mode of action after delivery into host cells are two chief challenges, the solution of which has high potential for the conception of new therapeutic strategies.

A principal outcome of the analysis of the R. solanacearum genome is the discovery of multiple genes related to the avirulence (avr) genes, described in several plant pathogenic bacteria. These avr genes encode effector proteins presumably injected into host cells through the hrp-encoded TTSS28. The 14 ORFs coding for products related to Avr determinants (8 with global homology and 6 with homology restricted to a specific domain) are distributed on both replicons. This finding is notable because to date avr genes have been found by functional assays only in phytopathogenic bacteria with a limited host range (often being confined to members of a single plant species or genus), the host range being molecularly governed by avr genes29. Furthermore, the presence of Avr-related proteins in R. solanacearum is surprising as no report of avr-dependent monogenic resistance of solanaceous crops towards bacterial wilt has been reported. It is probable that these Avr proteins must confer selective advantages, acting collectively as pathogenicity factors on a large set of host plants30. It is also possible that R. solanacearum possesses some general suppressor(s) of the plant defence response triggered by the recognition of Avr proteins by plant factors.

A second set of 9 proteins dispersed throughout the genome and potentially transiting through the TTSS was identified by homology to determinants in other plant pathogenic bacteria, located in the immediate regions flanking hrp loci. The hrp locus of Pseudomonas syringae, for example, is part of a pathogenicity island (PAI)31 that harbours several Hrp-dependent effector proteins32. A PIP-box consensus motif (TTCGC-N15-TTCGC)33, suggestive of hrp-dependent regulation, is present in the promoter sequence of 6 avr gene homologues (RS02644, RS04524, RS05218, RS05356, RS05373 and RS05468).

As the TTSS-dependent pathogenicity factors are thought to be injected directly into eukaryotic cells to exert their anti-host function, we looked for proteins exhibiting typical eukaryotic features and functions. As a result, we identified potential functional homologues of essential pathogenicity effectors found in bacterial pathogens of animals, such as protein kinases (RS01445 and RS05210) and a probable tyrosine phosphatase (RS03075). Another ORF product (RS04706) contains 3.5 internal repeats structurally related to the PPR motif34—a motif prevalent in plant organellar proteins, but never found to date in prokaryotic proteins. Three gene families encode proteins with ankyrin repeat domains35, a pirin-like domain36 and leucine-rich repeats (LRRs)37. Although these classes of proteins are not strictly restricted to eukaryotes, they appear to be implicated in eukaryotic signalling pathways, presumably through the properties of these specific domains to establish protein–protein interactions. Several LRR-containing proteins have been shown to transit through TTSSs in pathogens such as Yersinia, Shigella or Salmonella sp., and recently in R. solanacearum28,38. We found a total of 10 genes coding for 3 families of LRR-containing proteins scattered in the genome, drawing attention to the probable functional redundancy of these candidate effectors of pathogenicity. Considering the above observations, we estimate that the number of genes encoding potential TTSS-dependent effectors is 40 or higher. This number is greater than that (25) estimated for the bacterial agent of dysentery Shigella flexneri39. Genome-scale studies carried out on other plant-pathogenic bacteria will soon reveal whether this high number of effectors is correlated with the wide host range of R. solanacearum.

Evolution of virulence

Although most of the candidate genes encoding TTSS-dependent effectors are not found near the hrp gene cluster and are distributed evenly on both replicons, approximately half of them appear to reside within clusters of ACURs. Five of these ACUR clusters have the typical features of PAIs such as the presence of DNA sequences indicative of gene mobility (insertion sequences, transposases) or recombination events (Rhs and Vgr elements), and, in some cases, the association of these regions with tRNA or prophage sequences31 (Fig. 2). Moreover, 15 of the approximately 40 candidate TTSS-dependent effectors, identified on the basis of homologies or structural features, have a significantly different G+C content than the mean of the R. solanacearum genome (see Supplementary Information Table 2). This suggests that they may have been acquired through horizontal gene transfer. In addition, some observations suggest that ACURs could also contribute to the evolution of new virulence genes. For example, there are three genes related to the host specificity determinant pthG from Erwinia herbicola40 (RS04524, RS05166 and RS05218), located in three different ACURs. These gene products all have a common PthG amino terminus (54% identical residues over the 35 N-terminal amino acids) but with a variable carboxy terminus. Gene duplication and evolution may possibly account for the generation of new virulence specificities in the C termini while maintaining appropriate expression and targeting signals in the N termini. A similar mechanism of duplication followed by differential evolution of domains can also be observed with members of the haemagglutinin-related protein family (such as RS02101 and RS02405) located within or at the border of ACURs. In addition, the deletion of genes could also serve as a means of bacterial adaptation. Evolutionary models of interactions between plants and plant-pathogenic bacteria state that the emergence of a plant resistance gene that recognizes a virulence gene would abolish the value of the corresponding virulence gene41. Possible evidence for this hypothesis may be found in the occurrence of single or multiple frameshifts or insertion sequence insertions in putative virulence gene homologues (RS00660, RS03919 and RS02460). These deactivating mechanisms may be the result of an evolutionary elimination of genes that have become liabilities for the pathogen.

A close investigation of each of the 30-kb regions flanking the hrp locus did not reveal significant changes in the G+C content nor the presence of DNA mobility associated elements typically observed in ACURs. This indicates that, contrary to the P. syringae hrp gene cluster and flanking regions32, the R. solanacearum hrp locus and flanking regions containing virulence genes is not, in the strictest sense, a PAI. Instead, our analysis suggests that this region is composed of a core group of ancestral pathogenicity genes that have been subjected to long coevolution with the R. solanacearum genome. Thus, the evolutionary status of the hrp region is in sharp contrast with the set of candidate effector-encoding genes that are scattered in the genome and often associated with ACURs. This leads us to speculate that R. solanacearum, along with an ancestral TTSS and associated effectors, acquired new effector genes through horizontal transfer, thereby remaining a successful pathogen.

Comparative genomics

A comparison of the R. solanacearum proteome with the proteome of bacteria that have been sequenced entirely reveals a close similarity of R. solanacearum with two other large-genome bacteria (Supplementary Information Figs 5 and 6), P. aeruginosa42 and Sinorhizobium melitoti43. Like R. solanacearum, these two species interact with eukaryotic hosts and can be free living in soil. In particular, the proteome of R. solanacearum shares many common features with that of the opportunistic pathogen P. aeruginosa: (1) a diversity of nutritional pathways; (2) many membrane transport systems; (3) complex chemosensory systems; and (4) a large number of regulatory genes, with a high proportion of two-component regulatory system proteins (37 sensors, 55 response regulators and 6 sensor-response regulator hybrids in R. solanacearum). These features are consistent with the evolution of adaptative responses to changes in environment, permitting the bacteria to thrive in diverse ecological niches.

We used a similar comparative analysis on R. solanacearum candidate pathogenicity determinants (Supplementary Information Table 3). Only homologies restricted to degradative enzymes, resistance to oxidative stress, haemolysins, peptide synthases and certain attachment structures were found with the plant pathogen X. fastidiosa8. The same functions also seem to be widely conserved among other bacterial proteomes, contrarily to most of the predicted TTSS-dependent effectors. Several TTSS-dependent effectors appear to be conserved in the plant-pathogenic bacteria, as demonstrated by conservation of 13 known effector proteins. However, with two exceptions (AvrRxv/YopP-related proteins4 and possibly RS02872), these effectors are not found in bacterial pathogens of animals. Significantly, most of these TTSS-dependent effectors are not found in the genome of Ralstonia metallidurans, a non-pathogenic bacterium that is taxonomically the closest related species to R. solanacearum. Our results imply that bacterial pathogens of plants and animals, despite extremely high conservation of the apparatus used to secrete effectors, harbour distinct arrays of specialized effectors. According to this hypothesis, TTSS-dependent effectors would have different cellular targets or effects on the respective hosts, resulting in the differential cellular processes observed during infection of animal and plant cells.

Methods

Sequencing and assembly

We generated about 80,000 sequences from both ends of genomic clones ranging from 1.5 to 100 kb. The steps to obtain the final sequence (shotgun assembly, gap closure and polishing) are described in the Supplementary Information. The validity of the sequence was assessed by comparing the restriction enzyme pattern deduced from the sequence to the experimentally observed restriction pattern obtained by digestion of clone DNA of a minimal tilling path composed mainly of bacterial artificial chromosome (BAC) clones with 6-base recognition enzymes. A total of 99.25% of the sequences were validated with at least two different restriction enzymes, 0.49% with only one enzyme, and 0.26% of the sequence was not validated, potentially owing to local mis-assemblies (the largest region is less than 5.1 kb). The location of these regions can be found in the Supplementary Information.

Gene prediction and annotation

Sequence analysis and annotation were performed using iANT (integrated Annotation Tool)44 as described for S. meliloti45, except that the probabilistic Markov model for coding regions used by the gene prediction software FrameD was constructed on 77 R. solanacearum gene sequences obtained from public databanks. The alternative matrix was built using genes first identified in ACURs based on homology, as revealed by BLASTX analysis. Predicted ORFs were reviewed individually by gene annotators for start codon assignment. Output of Prosite search and BLASTP analysis on the corresponding products were also individually expertized to generate the proposed annotations. Proteins were classified according to Riley's rules46. The complete annotated genetic map, search tools (SRS, BLAST), annotation and process classification are available at http://sequence.toulouse.inra.fr/R.solanacearum.html.

Statistical analysis

We performed statistical analysis using Fisher's exact test47.