Comparisons of dN/dS are time dependent for closely related bacterial genomes
Introduction
Comparisons of the relative rates of change at synonymous (silent) and non-synonymous (replacement) sites have for many years been a central tenet of molecular evolution (Kimura, 1991). The relative frequencies of these changes are determined by a complex blend of stochastic and selective forces acting at many different levels from a single site, codon, genome or population (metagenome). Estimates for dN (the number of non-synonymous changes per non-synonymous site) and dS (the number of synonymous changes per synonymous site) are typically interpreted in terms of the selective consequences of these changes. Whilst this may be valid for comparisons between relatively diverged eukaryotic species, where most of the non-synonymous changes are fixed (i.e. they are substitutions), for comparisons within populations, or between closely related bacterial genomes where “species” are not easily defined, it is not valid to assume that the selective consequences of non-synonymous change are effectively instantaneous. In such cases, the possibility that the dN/dS ratio might change over time due to a lag in the removal of slightly deleterious mutations must be considered.
Synonymous substitutions are usually regarded as neutral, or at least as having a much smaller effect on fitness than non-synonymous substitutions. The dN/dS ratio resulting from the comparison between two orthologous genes therefore has both theoretical and practical implications as it can reveal the type of selection pressure acting on the genes. A low ratio (dN/dS⪡1) indicates strong purifying (“stabilizing”) selection, whereas a high ratio (dN/dS>1) indicates selection for diversification (“positive selection”). The calculation of dN/dS can therefore help to identify genes or domains under particular biochemical or ecological constraint, or conversely putative virulence factors or candidate vaccine targets subject to diversifying (frequency-dependent) selection from the host immune response (Smith et al., 1995) Although the average dN/dS ratio over a whole coding region is a fairly blunt tool for detecting positive selection and more sensitive approaches focusing on specific codons are becoming increasingly common, these approaches still focus on local variations of the dN/dS ratio (Suzuki et al., 2001; Nielsen and Yang, 1998; Yang and Bielawski, 2000a).
Most studies on the relative rates of change at synonymous and non-synonymous sites have focused on sequences which shared a common ancestor millions of years ago. For highly diverged sequences the problem of multiple substitutions occurring at a single site complicates the calculation of dN and dS and a great deal of effort has been spent in perfecting methods which correct for this problem (Li et al., 1985; Nei and Gojobori, 1986; Yang and Nielsen, 2000b). Far less attention has been paid to much more closely related sequences belonging to the same named bacterial species and differing by, say, <2% of nucleotide sites. Although such sequences are far less likely to have experienced multiple substitutions, these comparisons are not drawn without difficulties. The paucity of nucleotide changes necessitates the use of large amounts of sequence data in order to achieve statistically meaningful results, and the estimates are extremely sensitive to sequencing error. Fortunately, complete genome sequence data are now available for a number of bacterial taxa. These data alleviate the statistical problems and allow closer examination of the relative rates of synonymous and non-synonymous change at a fine evolutionary scale; that is between sequences which shared a common ancestor from perhaps decades to hundreds of thousands of years ago.
Sequencing error, however, remains a critical issue for comparisons between closely related genomes, as each error becomes proportionately more important when the true number of changes is small. Importantly, too, one would expect sequencing errors to tend to make an observed dN/dS ratio approach unity as there need be no bias with respect to codon position. To eliminate this problem as far as possible, we have been highly selective as regards the genomes to employ and have rechecked putative changes by re-sequencing and by re-analysis of the trace files.
There are reasons to be confident that sequencing errors will not generally compromise this analysis. In their comparison of the two genomes of Mycobacterium tuberculosis. Gutacker et al. (2002) checked ∼300 synonymous changes and reported that 91% of these changes were accurate. Qualitatively similar results were obtained for the analysis of close genome sequences of Bacillus anthracis (Read et al., 2002) and M. tuberculosis (Fleischmann et al., 2002). We have even more confidence in the accuracy of other genome sequences. We checked 285 unique single nucleotide changes within the MSSA476 genome of Staphylococcus aureus and found them to be 100% accurate, and a sample of 30 other unique single nucleotide changes in other S. aureus genomes were re-sequenced and also found to be 100% correct. Furthermore, statistical analysis of the quality of reads in assembled genome sequence projects at the Sanger Centre, UK, also point towards low error rates. In the case of the S. aureus MSSA476 genome sequence, the predicted sequence error rate as calculated from the consensus confidence of all bases in the assembled sequence is 1 in 7807752 i.e. (0.37 bases per genome) (MTGH unpublished data).
The dN/dS ratio of bacterial genes differing at 1–2% of nucleotides, and assumed to be under stabilizing selection, generally fall within the range of 0.04–0.2 (Feil et al., 2003; Dingle et al., 2001; Jolley et al., 2000; Jones et al., 2003; Meats et al., 2003). However, in cases where sequences are even more closely related, a relative preponderance of non-synonymous change is noted. Among the previously cited works, Gutacker et al. (2002) identified only ∼900 single base changes within coding regions when they compared two genomes of the very uniform species M. tuberculosis, corresponding to ∼2.5 changes per 10,000 sites. The authors noted that 65% of these single base changes were non-synonymous, corresponding to a dN/dS ratio of ∼0.6. Holden et al. presented an even more extreme example in their comparison of two genomes of S. aureus (Holden et al., 2004). These genomes correspond to the same Sequence Type (ST) by Multilocus Sequence Typing (MLST) (Maiden et al., 1998; Holden et al., 2004) and only differ by 285 single base changes within coding regions, corresponding to ∼1 change per 10,000 sites. Approximately, 70% of the changes were non-synonymous and, as non-synonymous sites are three times more common than synonymous sites, the dN/dS ratio is therefore approaching parity (∼0.8).
A surprisingly high dN/dS ratio between very similar sequences has been noted in other taxa. In their comparative genomics analysis of four bacterial species, King Jordan et al. noted that the average dN/dS was unusually high between two very closely related genomes of Chlamydia pneumoniae (Jordan et al., 2002). Read et al. (2003) compared the genome sequence of B. anthracis isolated from a victim of the 2001 anthrax attack in Florida, USA, with the sequence from a reference strain. These authors also noted that the few single nucleotide changes detected tended to be non-synonymous. The same effect can be observed using multi-locus sequence data from large population samples rather than pan-genome comparisons of a limited number of strains. Feil et al. analysed MLST data for 334 strains of S. aureus and noted a higher frequency of non-synonymous change between closely related haplotypes than between more diverged strains (Feil et al., 2003). Finally, Baker et al. (2004) generated MLST data for 316 isolates of M. tuberculosis, and noted a similar dN/dS ratio to the genome-wide analysis by Gutacker et al. (2002).
The studies of closely related bacterial strains are crucial to understanding the basic principles of genetic diversification and niche adaptation, particularly for pathogenic bacteria (Hacker et al., 2003; Feil, 2004). The genomics era has greatly invigorated efforts to establish the genetic basis of clinically relevant phenotypes, most notably antibiotic resistance or heightened virulence, although the emphasis has tended to be on overall differences in gene content rather than the more subtle process of allelic diversification (Feldgarden et al., 2003). Multilocus sequence analyses are also leading to hypotheses concerning the effect of the selective landscape on the emergence of adaptive clones, hitch-hiking and the significance of homologous recombination (Feil, 2004).
Various interpretations of high dN/dS ratios among closely related strains have been offered, including statistical artefact (Jordan et al., 2002), relaxed (Read et al., 2002) or positive (Baker et al., 2004) selection and recent ancestry (Gutacker et al., 2002; Feil et al., 2003). However, no study has examined the intrinsic dependency of the dN/dS ratio with time from divergence. Here we explore this issue through the use of a model and simulation where the influence of population genetics parameters is made explicit. These results are then substantiated by comparative genomic analyses using a range of genetic distance from practically identical to the degree of divergence over all gene loci which will typically encompass a named bacterial species (∼2%). We also consider the effects of hitch-hiking on the trajectory of this change over evolutionary time.
Section snippets
Data
The sequence and annotation data from complete bacterial genomes was taken from GenBank Genomes and included 7 genomes from the complex Escherichia coli+Shigella flexneri, 7 genomes from the complex Bacillus cereus+anthracis+thuringiensis, 4 genomes of C. pneumoniae, 5 genomes of Streptococcus pyogenes, 6 genomes of S. aureus and 3 genomes of the complex M. tuberculosis+bovis. The complete list and accession number of genomes is available as supplementary material (Table S1). Some genomes were
Exploring the model and simulation
The model and simulation differ in one important aspect, whilst the simulation incorporates genetic drift (by re-sampling from the population with replacement at every generation), the model does not incorporate drift but instead assumes a very large population size, in which case the effect of drift is negligible. Naturally, in such a population a synonymous mutation will not reach fixation but will remain at its initial frequency. The synonymous divergence between two randomly chosen genomes
Discussion
Although the dN/dS ratio between orthologous genes is often treated as though it were constant with time, a number of authors have recently noted this ratio to be surprisingly high when very closely related sequences are compared. This has typically either been dismissed as a statistical artefact (Jordan et al., 2002), or interpreted as evidence of positive or relaxed selection (Baker et al., 2004; Read et al., 2002). Here we have provided consistent evidence, both through the simulation of a
Why does dN/dS against time vary between taxa?
It is clear that differences between taxa in terms of the dN/dS ratio may have no selective relevance but simply reflect time since divergence (Ho et al., 2005). A similar time dependency has been observed previously for viral populations (Holmes, 2003; Sharp et al., 2001). However, differences between taxa in the relationship between dN/dS and time, whilst illustrating the power of combining comparisons from multiple taxa, require additional explanation. Fig. 4, Fig. 5 reveal very similar
Concluding remarks
Although the use of intergenic similarity probably provides a more reliable estimate of than changes in gene content, it is not valid to assume that changes in intergenic regions are without selective consequence, as these sequences will contain gene promoters and other regulatory elements. It is therefore not possible at present to directly compare the rate of change of dN/dS against that expected from the model, although in principle such a comparison could provide a powerful means to
Acknowledgements
We are grateful to Nick Britton and Richard Ward for their mathematical expertise, and to Adam Eyre-Walker for constructive criticisms on the manuscript. This work was greatly facilitated by a Royal Society short-term overseas visit grant awarded to EPCR and EJF. JEC and EJF are funded by an MRC Career Development Award.
Historical note: This paper began with a series of discussions between John Maynard Smith, Noel Smith and myself in early 2003. Whilst John and Noel were musing over the
References (45)
- et al.
Microevolutionary genomics of bacteria
Theor. Popul. Biol.
(2002) - et al.
The altered evolutionary trajectories of gene duplicates
Trends Genet.
(2004) DNA repeats lead to the accelerated loss of gene order in bacteria
Trends Genet.
(2003)- et al.
Statistical methods for detecting molecular adaptation
Trends Ecol. Evol.
(2000) - et al.
Silent nucleotide polymorphisms and a phylogeny for Mycobacterium tuberculosis
Emerg. Infect. Dis.
(2004) - et al.
Heterogeneity of genome sizes among natural isolates of Escherichia coli
J. Bacteriol.
(1995) What are bacterial species?
Annu. Rev. Microbiol.
(2002)Multilocus sequence typing system for Campylobacter jejuni
J. Clin. Microbiol.
(2001)Small change: keeping pace with microevolution
Nat. Rev. Microbiol.
(2004)How clonal is Staphylococcus aureus?
J. Bacteriol.
(2003)
Gradual evolution in bacteria: evidence from Bacillus systematics
Microbiology
Phylogenies and the comparative method
Am. Nat.
Whole-genome comparison of Mycobacterium tuberculosis clinical and laboratory strains
J. Bacteriol.
Genome-wide patterns of nucleotide substitution reveal stringent functional constraints on the protein sequences of thermophiles
Genetics
Genome-wide analysis of synonymous single nucleotide polymorphisms in Mycobacterium tuberculosis complex organisms: resolution of genetic relationships among closely related microbial strains
Genetics
Prokaryotic chromosomes and disease
Science
Selection intensity for codon bias
Genetics
Preponderance of slightly deleterious polymorphism in mitochondrial DNA: nonsynonymous/synonymous rate ratio is much higher within species than between species
Mol. Biol. Evol.
Time dependency of molecular rate estimates and systematic overestimation of recent divergence times
Mol. Biol. Evol.
Complete genomes of two clinical Staphylococcus aureus strains: evidence for the rapid evolution of virulence and drug resistance
Proc. Natl Acad. Sci. USA
Patterns of intra- and interhost nonsynonymous variation reveal strong purifying selection in dengue virus
J. Virol.
Carried meningococci in the Czech Republic: a diverse recombining population
J. Clin. Microbiol.
Cited by (0)
- †
Deceased.