Asymmetric substitution patterns: a review of possible underlying mutational or selective mechanisms
Introduction
Chargaff (1950) experimentally determined A=T and G=C equimolar frequencies when analysing both DNA strands together. Three years later, Watson and Crick (1953) determined the DNA secondary structure and stated the base-pairing rules that explain such frequencies. More surprisingly, these equalities are still observed within each strand (Lin and Chargaff, 1967). Under no-strand-bias conditions, when mutation and selection have equal effect on both strands, there are six possible substitution rates instead of 12, as stated in the Parity Rule type 1, PR1 (Sueoka, 1995). The rationale for this is as follows: since substitution rates are scored on one strand, a change such as T to C in a given strand results from either a T→C substitution on that strand or an A→G on the complementary strand. The type-2 parity rule, PR2 can be formally derived from PR1 (Lobry, 1995) to give the base frequencies within each strand at equilibrium: A=T and G=C. Moreover, convergence to PR2 is also expected when the substitution rates are not constant over time (Lobry and Lobry, 1999). Any deviation from PR2 implies asymmetric substitution: the result of different mutation rates, different selective pressures, or both, between the two strands of DNA. There are two principal ways of studying asymmetric substitution; phylogenetic reconstruction of base substitution and detection of deviations from PR2.
In the first method, asymmetries are detected by aligning homologous sequences, estimating the substitution matrix and comparing the frequencies of complementary changes. Wu and Maeda (1987) used this method to test for asymmetric substitution in a region of the β-globin complex of primates. They did detect asymmetries, but since origins and termini of replication of their data were not known, their results are not reliable, as shown by Bulmer (1991), who re-examined an adjacent region of the human β-globin complex. Francino et al. (1996) used the same method to search for asymmetric substitution in eubacteria. They found a difference between complementary changes C→T and G→A when scoring substitutions on the coding strand. The advantage of this method is that it directly detects the number and type of substitutions, but the access to a suitable data set is rather limited because of the difficulty of finding orthologous sequences with an adequate divergence time.
The second method (Lobry, 1996a) builds on the analysis of the DNA sequences for deviations from A=T and G=C frequencies. In 1990 such deviations in SV40 were interpreted as evidence for asymmetric mutation pressure because of a polarity switch at the origin of replication (Filipski, 1990). GC AT skews and are measured for example as the quantity of (C−G)/(G+C) and (A−T)/(A+T) along a DNA sequence using a sliding window. Lobry (1996a) showed the existence of GC and AT skews in the genome of Haemophilus influenzae and in parts of the Escherichia coli and Bacillus subtilis genomes. There is a disadvantage to this method as it is indirect, but the increasing number of completely sequenced genomes allows an extensive analysis of the variation in nucleotide composition within and between genomes. Several recent studies used this method to analyse mitochondrial, viral and bacterial genomes (with emphasis on the latter) for compositional asymmetries, revealing systematic deviations from PR2 in these genomes.
In bacteria, the deviations switch sign at the origin and terminus of replication, such that the leading strand of replication is generally richer in G than in C, and in T than in A. Absolute values for AT skews tend to be lower than for GC skews. Grigoriev (1998) measured skews over all bases and found that the leading strand generally contains more G than C. Third codon position skews show a GT-rich leading strand in all eubacteria (Francino and Ochman, 1997, McLean et al., 1998) except in the Mycoplasma species, where it is CT-rich (McLean et al., 1998), and in Synechocystis, where no skew was detected (McLean et al., 1998, Mrázek and Karlin, 1998). Archaebacteria generally do not skew (Karlin, 1998, Mrázek and Karlin, 1998), except for M. thermoautotrophicum, where a weak skew has been detected (McLean et al., 1998, Rocha et al., 1998). Compositional studies of all genome positions in several bacteria report a correlation between purine and coding strand excess (Freeman et al., 1998), and an excess of keto bases (GT) over amino bases (AC) in the leading strand (Freeman et al., 1998, Perrière et al., 1996). Rocha et al. (1998) observed compositional asymmetries between the leading and lagging strand genes at the level of nucleotides, codons and amino acids. Additionally, a strand compositional asymmetry was confirmed in the complete genomes of B. subtilis (Kunst et al., 1997), E. coli (Blattner et al., 1997), Rickettsia prowazekii (Andersson et al., 1998) and Treponema pallidum (Fraser et al., 1998).
The deviation divides the chromosome into two segments that are homogenous for GC(AT) skews, called chirochores (Lobry, 1996a) in analogy with isochores (Bernardi, 1989), which are domains of mammalian chromosomes with homogenous GC content. Chirochores coincide with replichores (Blattner et al., 1997), so that skews switch sign at the origin and terminus of replication. This polarity switch allows for the confirmation of the origin of replication. The method was used to predict the origin in Mycoplasma genitalium (Lobry, 1996b), R. prowazekii (Andersson et al., 1998), T. pallidum (Fraser et al., 1998) and Borrelia burgdorferi (Fraser et al., 1997), where the origin could not be detected (because of a lack of consensus patterns) or had not yet been detected experimentally. There is now experimental evidence that the replication origin is located where it was predicted in B. burgdorferi (Picardeau et al., 1999).
The perhaps clearest skews are seen in mitochondria: studies of the nucleotide composition of mitochondrial genomes (Jermiin et al., 1995, Perna and Kocher, 1995, Reyes et al., 1998, Tanaka and Ozawa, 1994) all report patterns of asymmetric substitution.
An early study of the bacteriophage λ genome (Daniels et al., 1983) reveals base distribution skew in this molecule, but gives no biological interpretation for the skew. Recently, Mrázek and Karlin (1998) observed asymmetric substitution in some herpesviruses and in the phages λ and T7, and Grigoriev (1998) detected skew in adenovirus type 40.
During revision of this article, several additional publications appeared, showing the existence of strand asymmetries for instance in chloroplasts and ds DNA viruses (see Note added in proof).
Generally, there are two ways of looking at the evolutionary changes of nucleotide composition; the selectionist and the neutralist point of view. These hypotheses differ in the estimate of the role of selection on base substitution. The neutralist hypothesis assumes that the average composition of non-coding DNA depends on a bias of selectively neutral mutations which accumulate during evolution. As an example, there are two main theories that explain the origin of isochores. According to the selectionist hypothesis, isochores are the result of positive selection for GC content as an adaptation to the high body temperature in warm blooded vertebrates (Bernardi et al., 1985). GC content would thereby be the result of positive selection for the functional advantages of the GC content itself. The mutational hypothesis assumes that the compositional biases of mutagenic processes are different in structurally and functionally distinct segments of DNA (Sueoka, 1988, Sueoka, 1992). This hypothesis is based on directional mutation pressure, and must therefore be regarded as being neutral rather than selective. Similarly, selective and mutational theories can be developed for the origin of strand specific nucleotide composition (although the selective hypotheses do not assume that asymmetric substitution is positively selected for because of a functional advantage of the asymmetry itself). Even though the mechanisms creating such patterns are not fully understood, recent publications provide us with several plausible hypotheses, which have been partly summarised in two recent papers (Francino and Ochman, 1997, Mrázek and Karlin, 1998). The idea of this review is to investigate current hypotheses and classify them as mutational or selective. There is, however, at least one hypothesis that must be regarded as based on both mutation and selection.
Section snippets
Bias on local scale
In organisms in which a large proportion of the genome consists of coding sequence (prokaryotes, mitochondria, chloroplasts and viruses), selective bias acting on a local scale can potentially influence global nucleotide composition. Because of the low proportion of control sequences and different species of RNA that do not translate into proteins, only protein coding sequences will be considered here.
Mutational mechanisms
Two important facts strongly suggest that strand asymmetries could be caused by mutational mechanisms. First, the violation of PR2 is pronounced at third codon positions and intergenic regions (Lobry, 1996a), where the selective pressure should be nearly neutral or at least weak. Second, the GC and AT deviations switch sign at origin and terminus of replication, which suggests a coupling with replication, repair or both.
Combination of selection and mutation
There is at least one possible mechanism that involves both selection and mutation. Francino et al. (1996) and Francino and Ochman (1997) suggested that processes which distinguish between transcribed/non-transcribed strand can account for DNA asymmetry. Transcription alone would not distinguish between leading and lagging strand, but in combination with biased gene orientation (discussed in Section 2.2.2), transcription-induced mutations could generate the compositional asymmetry between
Discussion
Compositional studies of bacterial, mitochondrial and viral genomes has established the existence of deviations from the frequencies A=T and G=C expected under no-strand-bias conditions. Skew values differ depending on what part of the genome is studied and different genomes conform differently to the predicted models. Therefore, compositional asymmetry could be a result of superposition of different mechanisms that influence base composition to different extents, and act differently in
Note added in proof
During the revision of this paper, some additional articles of great interest were published. A study by Grigoriev (1999, Virus Res. 60, 1–19) reveals compositional asymmetry between leading and lagging strand in 22 complete sequences of ds DNA viruses. Possible contributions of transcription and replication (and their associated repair mechanisms) are discussed along with other potential sources of strand bias. Similarly, in the chloroplast genome of Eugena gracilis (Morton, 1999, Proc. Natl.
References (103)
When polymerases collide: replication and the transcriptional organization of the E. coli chromosome
Cell
(1988)- et al.
Complete nucleotide sequence of bacteriophage T7 DNA and the locations of T7 genetic elements
J. Mol. Biol.
(1983) - et al.
Archaea and the origin(s) of DNA replication proteins
Cell
(1997) - et al.
Strand asymmetries in DNA evolution
Trends Genet.
(1997) Heterogenity of DNA repair at the gene level
Mutat. Res.
(1991)Correlation between the abundance of Escherichia coli transfer RNAs and the occurrence of the respective codons in its protein genes
J. Mol. Biol.
(1981)- et al.
Replication errors rates for T·dGTP and A·dGTP mispairs and evidence for differential proofreading by leading and lagging strand DNA replication in human cells
J. Biol. Chem.
(1995) DNA replication fidelity
J. Biol. Chem.
(1992)- et al.
On the denaturation of deoxyribonucleic acid. II. Effects of concentration
Biochim. Biophys. Acta
(1967) Influence of genomic G+C content on average amino-acid composition of proteins from 59 bacterial species
Gene
(1997)
Base mispair extension kinetics. Comparison of DNA polymerase alpha and reverse transcriptase
J. Biol. Chem.
Archaeal genomics: an overview
Cell
Mispair, site-, and strand-specific error rates during simian virus 40 origin-dependent replication in vitro with excess deoxythymine triphoshate
J. Biol. Chem.
Skewed oligomers and origins of replication
Gene
Base selection, proofreading and mismatch repair during DNA replication in Escherichia coli
J. Biol. Chem.
Codon usage and genome evolution
Curr. Opin. Genet. Dev.
Strand asymmetry in human mitochondrial DNA mutations
Genomics
Translation framing code and frame-monitoring mechanism as suggested by the analysis of mRNA and 16 S rRNA nucleotide sequences
J. Mol. Biol.
Discontinuous DNA replication in a lig-7 strain of Escherichia coli is not the result of mismatch repair, nucleotide-excision repair, or the base-excision repair of DNA uracil
Biochem. Biophys. Res. Comm.
Sequence and organization of the human mitochondrial genome
Nature
The genome sequence of Rickettsia prowazekii and the origin of mitochondria
Nature
Genetics and enzymology of DNA replication in Escherichia coli
Annu. Rev. Genet.
Transcription-induced mutations: increase in C to T mutations in the non-transcribed strand during transcription in Escherichia coli
Proc. Natl. Acad. Sci. USA
Correlation between transcription and C to T mutations in the non-transcribed DNA strand
Biol. Chem.
The mosaic genome of warm-blooded vertebrates
Science
The isochore organization of the human genome
Annu. Rev. Genet.
The complete genome sequence of Escherichia coli K-12
Science
Deducing the pattern of arthropod phylogeny from mitochondrial DNA rearrangements
Nature
Strand symmetry of mutation rates in the β-globin region
J. Mol. Evol.
Chemical specificity of nucleic acids and mechanism of their enzymatic degradation
Experientia
Deciphering the biology of Mycobacterium tuberculosis from the complete genome sequence
Nature
Restriction of the activity of the recombination site dif to a small zone of the Escherichia coli chromosome
Genes Dev.
Features of bacteriophage lambda: analysis of the complete nucleotide sequence
Association of increased spontaneous mutation rates with high levels of transcription in yeast
Science
Fidelity mechanisms in DNA replication
Annu. Rev. Biochem.
DNA polymerase accuracy and spontaneous mutation rates: frequencies of purine·purine, purine·pyrimidine and pyrimidine·pyrimidine mismatches during DNA replication
Proc. Natl. Acad. Sci. USA
Mutants in the Exo I motif of Esherichia coli dnaQ: defective proofreading and inviability due to error catastrophe
Proc. Natl. Acad. Sci. USA
Unequal fidelity of leading and lagging strand DNA replication on the Escherichia coli chromosome
Proc. Natl. Acad. Sci. USA
Evolution of DNA sequence, contributions of mutational bias and selection to the origin of chromosomal compartments
Whole-genome random sequencing and assembly of Haemophilus influenzae Rd
Science
Asymmetries generated by transcription-coupled repair in enterobacterial genes
Science
G and T nucleotide content show specie invariant negative correlation for all three codon positions
J. Biomol. Struct. Dynam.
The minimal gene complement of Mycoplasma genitalium
Science
Genomic sequence of a Lyme disease spirochaete, Borrelia burgdorferi
Nature
Complete genome sequence of Treponema pallidum, the syphilis spirochete
Science
A sensitive genetic assay for the detection of cytosine deamination: determination of rate constants and the activation energy
Biochemistry
Patterns of genome organization in bacteria
Science
Asymmetrical DNA replication promotes evolution: disparity theory of evolution
Genetica
Codon usage in bacteria: correlation with gene expressivity
Nucleic Acids Res.
Analysing genomes with cumulative skew diagrams
Nucleic Acids Res.
Cited by (249)
Genome evolution of Buchnera aphidicola (Gammaproteobacteria): Insights into strand compositional asymmetry, codon usage bias, and phylogenetic implications
2023, International Journal of Biological MacromoleculesStrand asymmetries across genomic processes
2023, Computational and Structural Biotechnology JournalCodon usage patterns and evolution of HSP60 in birds
2021, International Journal of Biological MacromoleculesCitation Excerpt :The CUB is species- and gene-specific, and can be affected by many factors, including nucleotide composition, expression level, tRNA abundance, gene length, RNA stability, protein structure and function, hydrophobicity and hydrophilicity, and environmental stress [2,4–12]. Several models have been postulated to explain CUB phenomenon including the genome hypothesis, the mutational theory, the natural selection theory, and the selection-mutation-drift model [1,5,11,13,14]. Among them, selection-mutation-drift model with the increasing attention considers that codon usage pattern result from the combined effects of three evolutionary forces (selection forces, mutation pressure and genetic drift) [13,15,16].