Comparisons of dN/dS are time dependent for closely related bacterial genomes

https://doi.org/10.1016/j.jtbi.2005.08.037Get rights and content

Abstract

The ratio of non-synonymous (dN) to synonymous (dS) changes between taxa is frequently computed to assay the strength and direction of selection. Here we note that for comparisons between closely related strains and/or species a second parameter needs to be considered, namely the time since divergence of the two sequences under scrutiny. We demonstrate that a simple time lag model provides a general, parsimonious explanation of the extensive variation in the dN/dS ratio seen when comparing closely related bacterial genomes. We explore this model through simulation and comparative genomics, and suggest a role for hitch-hiking in the accumulation of non-synonymous mutations. We also note taxon-specific differences in the change of dN/dS over time, which may indicate variation in selection, or in population genetics parameters such as population size or the rate of recombination. The effect of comparing intra-species polymorphism and inter-species substitution, and the problems associated with these concepts for asexual prokaryotes, are also discussed. We conclude that, because of the critical effect of time since divergence, inter-taxa comparisons are only possible by comparing trajectories of dN/dS over time and it is not valid to compare taxa on the basis of single time points.

Introduction

Comparisons of the relative rates of change at synonymous (silent) and non-synonymous (replacement) sites have for many years been a central tenet of molecular evolution (Kimura, 1991). The relative frequencies of these changes are determined by a complex blend of stochastic and selective forces acting at many different levels from a single site, codon, genome or population (metagenome). Estimates for dN (the number of non-synonymous changes per non-synonymous site) and dS (the number of synonymous changes per synonymous site) are typically interpreted in terms of the selective consequences of these changes. Whilst this may be valid for comparisons between relatively diverged eukaryotic species, where most of the non-synonymous changes are fixed (i.e. they are substitutions), for comparisons within populations, or between closely related bacterial genomes where “species” are not easily defined, it is not valid to assume that the selective consequences of non-synonymous change are effectively instantaneous. In such cases, the possibility that the dN/dS ratio might change over time due to a lag in the removal of slightly deleterious mutations must be considered.

Synonymous substitutions are usually regarded as neutral, or at least as having a much smaller effect on fitness than non-synonymous substitutions. The dN/dS ratio resulting from the comparison between two orthologous genes therefore has both theoretical and practical implications as it can reveal the type of selection pressure acting on the genes. A low ratio (dN/dS⪡1) indicates strong purifying (“stabilizing”) selection, whereas a high ratio (dN/dS>1) indicates selection for diversification (“positive selection”). The calculation of dN/dS can therefore help to identify genes or domains under particular biochemical or ecological constraint, or conversely putative virulence factors or candidate vaccine targets subject to diversifying (frequency-dependent) selection from the host immune response (Smith et al., 1995) Although the average dN/dS ratio over a whole coding region is a fairly blunt tool for detecting positive selection and more sensitive approaches focusing on specific codons are becoming increasingly common, these approaches still focus on local variations of the dN/dS ratio (Suzuki et al., 2001; Nielsen and Yang, 1998; Yang and Bielawski, 2000a).

Most studies on the relative rates of change at synonymous and non-synonymous sites have focused on sequences which shared a common ancestor millions of years ago. For highly diverged sequences the problem of multiple substitutions occurring at a single site complicates the calculation of dN and dS and a great deal of effort has been spent in perfecting methods which correct for this problem (Li et al., 1985; Nei and Gojobori, 1986; Yang and Nielsen, 2000b). Far less attention has been paid to much more closely related sequences belonging to the same named bacterial species and differing by, say, <2% of nucleotide sites. Although such sequences are far less likely to have experienced multiple substitutions, these comparisons are not drawn without difficulties. The paucity of nucleotide changes necessitates the use of large amounts of sequence data in order to achieve statistically meaningful results, and the estimates are extremely sensitive to sequencing error. Fortunately, complete genome sequence data are now available for a number of bacterial taxa. These data alleviate the statistical problems and allow closer examination of the relative rates of synonymous and non-synonymous change at a fine evolutionary scale; that is between sequences which shared a common ancestor from perhaps decades to hundreds of thousands of years ago.

Sequencing error, however, remains a critical issue for comparisons between closely related genomes, as each error becomes proportionately more important when the true number of changes is small. Importantly, too, one would expect sequencing errors to tend to make an observed dN/dS ratio approach unity as there need be no bias with respect to codon position. To eliminate this problem as far as possible, we have been highly selective as regards the genomes to employ and have rechecked putative changes by re-sequencing and by re-analysis of the trace files.

There are reasons to be confident that sequencing errors will not generally compromise this analysis. In their comparison of the two genomes of Mycobacterium tuberculosis. Gutacker et al. (2002) checked ∼300 synonymous changes and reported that 91% of these changes were accurate. Qualitatively similar results were obtained for the analysis of close genome sequences of Bacillus anthracis (Read et al., 2002) and M. tuberculosis (Fleischmann et al., 2002). We have even more confidence in the accuracy of other genome sequences. We checked 285 unique single nucleotide changes within the MSSA476 genome of Staphylococcus aureus and found them to be 100% accurate, and a sample of 30 other unique single nucleotide changes in other S. aureus genomes were re-sequenced and also found to be 100% correct. Furthermore, statistical analysis of the quality of reads in assembled genome sequence projects at the Sanger Centre, UK, also point towards low error rates. In the case of the S. aureus MSSA476 genome sequence, the predicted sequence error rate as calculated from the consensus confidence of all bases in the assembled sequence is 1 in 7807752 i.e. (0.37 bases per genome) (MTGH unpublished data).

The dN/dS ratio of bacterial genes differing at 1–2% of nucleotides, and assumed to be under stabilizing selection, generally fall within the range of 0.04–0.2 (Feil et al., 2003; Dingle et al., 2001; Jolley et al., 2000; Jones et al., 2003; Meats et al., 2003). However, in cases where sequences are even more closely related, a relative preponderance of non-synonymous change is noted. Among the previously cited works, Gutacker et al. (2002) identified only ∼900 single base changes within coding regions when they compared two genomes of the very uniform species M. tuberculosis, corresponding to ∼2.5 changes per 10,000 sites. The authors noted that 65% of these single base changes were non-synonymous, corresponding to a dN/dS ratio of ∼0.6. Holden et al. presented an even more extreme example in their comparison of two genomes of S. aureus (Holden et al., 2004). These genomes correspond to the same Sequence Type (ST) by Multilocus Sequence Typing (MLST) (Maiden et al., 1998; Holden et al., 2004) and only differ by 285 single base changes within coding regions, corresponding to ∼1 change per 10,000 sites. Approximately, 70% of the changes were non-synonymous and, as non-synonymous sites are three times more common than synonymous sites, the dN/dS ratio is therefore approaching parity (∼0.8).

A surprisingly high dN/dS ratio between very similar sequences has been noted in other taxa. In their comparative genomics analysis of four bacterial species, King Jordan et al. noted that the average dN/dS was unusually high between two very closely related genomes of Chlamydia pneumoniae (Jordan et al., 2002). Read et al. (2003) compared the genome sequence of B. anthracis isolated from a victim of the 2001 anthrax attack in Florida, USA, with the sequence from a reference strain. These authors also noted that the few single nucleotide changes detected tended to be non-synonymous. The same effect can be observed using multi-locus sequence data from large population samples rather than pan-genome comparisons of a limited number of strains. Feil et al. analysed MLST data for 334 strains of S. aureus and noted a higher frequency of non-synonymous change between closely related haplotypes than between more diverged strains (Feil et al., 2003). Finally, Baker et al. (2004) generated MLST data for 316 isolates of M. tuberculosis, and noted a similar dN/dS ratio to the genome-wide analysis by Gutacker et al. (2002).

The studies of closely related bacterial strains are crucial to understanding the basic principles of genetic diversification and niche adaptation, particularly for pathogenic bacteria (Hacker et al., 2003; Feil, 2004). The genomics era has greatly invigorated efforts to establish the genetic basis of clinically relevant phenotypes, most notably antibiotic resistance or heightened virulence, although the emphasis has tended to be on overall differences in gene content rather than the more subtle process of allelic diversification (Feldgarden et al., 2003). Multilocus sequence analyses are also leading to hypotheses concerning the effect of the selective landscape on the emergence of adaptive clones, hitch-hiking and the significance of homologous recombination (Feil, 2004).

Various interpretations of high dN/dS ratios among closely related strains have been offered, including statistical artefact (Jordan et al., 2002), relaxed (Read et al., 2002) or positive (Baker et al., 2004) selection and recent ancestry (Gutacker et al., 2002; Feil et al., 2003). However, no study has examined the intrinsic dependency of the dN/dS ratio with time from divergence. Here we explore this issue through the use of a model and simulation where the influence of population genetics parameters is made explicit. These results are then substantiated by comparative genomic analyses using a range of genetic distance from practically identical to the degree of divergence over all gene loci which will typically encompass a named bacterial species (∼2%). We also consider the effects of hitch-hiking on the trajectory of this change over evolutionary time.

Section snippets

Data

The sequence and annotation data from complete bacterial genomes was taken from GenBank Genomes and included 7 genomes from the complex Escherichia coli+Shigella flexneri, 7 genomes from the complex Bacillus cereus+anthracis+thuringiensis, 4 genomes of C. pneumoniae, 5 genomes of Streptococcus pyogenes, 6 genomes of S. aureus and 3 genomes of the complex M. tuberculosis+bovis. The complete list and accession number of genomes is available as supplementary material (Table S1). Some genomes were

Exploring the model and simulation

The model and simulation differ in one important aspect, whilst the simulation incorporates genetic drift (by re-sampling from the population with replacement at every generation), the model does not incorporate drift but instead assumes a very large population size, in which case the effect of drift is negligible. Naturally, in such a population a synonymous mutation will not reach fixation but will remain at its initial frequency. The synonymous divergence between two randomly chosen genomes

Discussion

Although the dN/dS ratio between orthologous genes is often treated as though it were constant with time, a number of authors have recently noted this ratio to be surprisingly high when very closely related sequences are compared. This has typically either been dismissed as a statistical artefact (Jordan et al., 2002), or interpreted as evidence of positive or relaxed selection (Baker et al., 2004; Read et al., 2002). Here we have provided consistent evidence, both through the simulation of a

Why does dN/dS against time vary between taxa?

It is clear that differences between taxa in terms of the dN/dS ratio may have no selective relevance but simply reflect time since divergence (Ho et al., 2005). A similar time dependency has been observed previously for viral populations (Holmes, 2003; Sharp et al., 2001). However, differences between taxa in the relationship between dN/dS and time, whilst illustrating the power of combining comparisons from multiple taxa, require additional explanation. Fig. 4, Fig. 5 reveal very similar

Concluding remarks

Although the use of intergenic similarity probably provides a more reliable estimate of t than changes in gene content, it is not valid to assume that changes in intergenic regions are without selective consequence, as these sequences will contain gene promoters and other regulatory elements. It is therefore not possible at present to directly compare the rate of change of dN/dS against that expected from the model, although in principle such a comparison could provide a powerful means to

Acknowledgements

We are grateful to Nick Britton and Richard Ward for their mathematical expertise, and to Adam Eyre-Walker for constructive criticisms on the manuscript. This work was greatly facilitated by a Royal Society short-term overseas visit grant awarded to EPCR and EJF. JEC and EJF are funded by an MRC Career Development Award.

Historical note: This paper began with a series of discussions between John Maynard Smith, Noel Smith and myself in early 2003. Whilst John and Noel were musing over the

References (45)

  • M. Feldgarden et al.

    Gradual evolution in bacteria: evidence from Bacillus systematics

    Microbiology

    (2003)
  • J. Felsenstein

    Phylogenies and the comparative method

    Am. Nat.

    (1985)
  • R.D. Fleischmann

    Whole-genome comparison of Mycobacterium tuberculosis clinical and laboratory strains

    J. Bacteriol.

    (2002)
  • R. Friedman et al.

    Genome-wide patterns of nucleotide substitution reveal stringent functional constraints on the protein sequences of thermophiles

    Genetics

    (2004)
  • M.M. Gutacker

    Genome-wide analysis of synonymous single nucleotide polymorphisms in Mycobacterium tuberculosis complex organisms: resolution of genetic relationships among closely related microbial strains

    Genetics

    (2002)
  • J. Hacker et al.

    Prokaryotic chromosomes and disease

    Science

    (2003)
  • D.L. Hartl et al.

    Selection intensity for codon bias

    Genetics

    (1994)
  • M. Hasegawa et al.

    Preponderance of slightly deleterious polymorphism in mitochondrial DNA: nonsynonymous/synonymous rate ratio is much higher within species than between species

    Mol. Biol. Evol.

    (1998)
  • S.Y. Ho et al.

    Time dependency of molecular rate estimates and systematic overestimation of recent divergence times

    Mol. Biol. Evol.

    (2005)
  • M.T. Holden

    Complete genomes of two clinical Staphylococcus aureus strains: evidence for the rapid evolution of virulence and drug resistance

    Proc. Natl Acad. Sci. USA

    (2004)
  • E.C. Holmes

    Patterns of intra- and interhost nonsynonymous variation reveal strong purifying selection in dengue virus

    J. Virol.

    (2003)
  • K.A. Jolley

    Carried meningococci in the Czech Republic: a diverse recombining population

    J. Clin. Microbiol.

    (2000)
  • Cited by (0)

    Deceased.

    View full text