Characterization of angiosperm nrDNA polymorphism, paralogy, and pseudogenes
Introduction
Sequences of the internal transcribed spacer region (ITS 1, 5.8S, and ITS 2) of nuclear ribosomal DNA (nrDNA) have been a staple source of data for the study of lower level phylogenetic relationships among plant taxa for more than ten years (e.g., Baldwin, 1992; Baldwin et al., 1995). In fact, the ITS region is one of the most widely applied molecular markers in current angiosperm systematics (e.g., Hershkovitz et al., 1999; Álvarez and Wendel, 2003). Some early reports of ITS variation provided results that were consistent with the existence of homogeneous nrDNA arrays within individuals (e.g., Ainouche and Bayer, 1997; Baldwin et al., 1995) presumably resulting from concerted evolution (gene conversion and unequal crossing over (Arnheim, 1983)). The effects of complete concerted evolution have been clearly documented between homeologous loci in some allotetraploid taxa of Gossypium (Wendel et al., 1995) and Paeonia (Sang et al., 1995). As a result, intra-individual polymorphism has generally been considered to be the exception rather than the rule for nrDNA (Mayol and Rosselló, 2001). Nevertheless, some studies have identified the occurrence of intra-individual nrDNA polymorphism in a range of taxa including non-hybrid diploids and allopolyploids (e.g., Baker et al., 2000; Buckler et al., 1997; Campbell et al., 1997; Denduangboripant and Cronk, 2000; Doyle et al., 1990; Fuertes-Aguilar et al., 1999; Gaut et al., 2000; Gernandt and Liston, 1999; Hartmann et al., 2001; Hughes et al., 2002; Jobst et al., 1998; Kita and Ito, 2000; Kuzoff et al., 1999; Learn and Schaal, 1987; Linder et al., 2000; Mayol and Rosselló, 2001; Muir et al., 2001; O’Kane et al., 1996; Sang et al., 1995; Suh et al., 1993; Vargas et al., 1999; Widmer and Baltisberger, 1999). It has also become clear that polymorphic individuals often contain potentially non-functional nrDNA copies (pseudogenes) in addition to functional copies (e.g., Buckler et al., 1997; Hartmann et al., 2001; Hughes et al., 2002; Kita and Ito, 2000; Mayol and Rosselló, 2001; Muir et al., 2001; Yang et al., 1999). Clearly complete concerted evolution can no longer be assumed when embarking on studies utilizing nrDNA sequences (ITS, ETS, 5.8S, 18S, or 26S) for phylogenetic analysis of plant taxa.
The implications and importance of nrDNA polymorphism and pseudogenes for phylogenetic analyses of plant nrDNA sequences were discussed in the groundbreaking work of Buckler et al. (1997) and more recently by Mayol and Rosselló (2001). However, several major issues relating to the characterization of nrDNA polymorphism and pseudogenes, and their implications for species level phylogeny reconstruction remain ambiguous. These issues involve how to determine orthology and paralogy of nrDNA sequences, how to define and detect nrDNA pseudogenes, and the desirability of using pseudogenes in studies of phylogenetic relationship. For example, views on the use of pseudogene sequences in phylogenetic analysis of species and higher taxa include both deliberate a priori exclusion (e.g., Yang et al., 1999) and explicit inclusion (e.g., Buckler et al., 1997; Hughes et al., 2002).
Here we review and attempt to clarify the issues surrounding nrDNA polymorphism, pseudogenes, and species tree reconstruction. In the first section, we examine the relationship between intra-individual polymorphism in nrDNA and levels of paralogy, arguing that gene tree reconstruction is essential for understanding the potential complexity of nrDNA evolution and for determining orthology and interspecific paralogy. We distinguish between “shallow paralogs” and “deep paralogs” in order to clarify how nrDNA polymorphism and paralogy can affect the inference of species trees from gene trees. We then illustrate why orthology and paralogy of sequences from different individuals should not be inferred from the potential functionality of those sequences, arguing again that gene tree analyses are essential.
In subsequent sections we focus on nrDNA pseudogenes more closely, particularly noting that the identification of pseudogenes can act as a test of nrDNA sampling and that a priori exclusion of pseudogenes from gene tree analyses is generally unjustified. Definitions of pseudogenes and methods of detecting pseudogenes applicable to nrDNA are discussed. We conclude that expression is a poor criterion for identifying nrDNA pseudogenes and that patterns of nucleotide substitution are more appropriate in the context of phylogenetic systematics. We explore the relevance, reliability, and effectiveness of pseudogene detection methods which examine patterns of nucleotide substitution and present a formalized tree-based approach. Examples illustrating these issues and emphasizing the importance of sampling are provided through the reanalysis of nrDNA ITS data sets from Lophocereus Britton & Rose (Hartmann et al., 2001) and Brassicaceae (Yang et al., 1999) using a tree-based approach that relies on bootstrap hypothesis testing. Both of these data sets include putatively functional and non-functional copies. We show that the tree-based method has a number of general advantages. Compared to other approaches, it was more powerful at detecting pseudogenes, it revealed complex substitution patterns across gene trees that suggested a much broader range of evolutionary mechanisms, and it was useful for detecting errors such as long branch attraction.
Section snippets
Nuclear rDNA polymorphism, orthology, paralogy, and species trees
The reconstruction of phylogenetic relationships among species and higher taxa (species-trees) using data from multicopy sequences, such as nrDNA, depends upon a clear understanding of sequence relationships (gene trees: e.g., Avise, 1989; Doyle, 1992; Goodman et al., 1979; Pamilo and Nei, 1988; Sanderson and Doyle, 1992). If sequences are mistaken as orthologous (derived from speciation events), when they are in fact paralogous (derived from gene duplication events), relationships among
Orthology, paralogy, and functionality
Our argument in this section is that orthology and paralogy should not be inferred on the basis of sequence functionality. Because pseudogene copies of nrDNA can be easily amplified, researchers are faced with the question of their relationship to the functional copies from which they were likely derived, specifically, whether they represent deep or shallow paralogs. It is easy to assume that pseudogenes will be interspecific paralogs to functional copies, but this need not be the case. Gene
When are nrDNA polymorphisms likely to be of interest?
All known examples of interspecifically maintained nrDNA paralogous polymorphism occur among closely related species or, in a few cases, between genera. To the best of our knowledge, shared polymorphisms in more distantly related taxa have not been directly uncovered in studies of plant nrDNA. For example, there is no evidence of deep maintained paralogs between conifers and Arabidopsis. For pseudogene related polymorphisms this is hardly surprising given that extensive divergence would
Why are nrDNA pseudogenes of interest?
In a subset of the cases in which intra-individual polymorphisms in nrDNA have been identified, the potential functionality of sequences has also been assessed. These studies suggest that putatively functional and non-functional sequences are both commonly amplified when nrDNA polymorphisms are encountered (e.g., Buckler et al., 1997). Furthermore, the dynamics of amplification (involving copy number, secondary structure, and primer site conservation) can result in the preferential
Definitions of nrDNA pseudogenes
Standard definitions of “pseudogene” are difficult to apply to nrDNA. Consider the following common definitions: “a silent, non-functional DNA sequence,” “non-functional genes related in sequence to functional genes,” and “sequences which resemble the functional genes with which they are associated, but which differ at a number of base pair sites and are not transcribed because they have internal “stop” codons” (all in Futuyma, 1998). These definitions, or aspects of them, are problematic when
Detecting pseudogenes
How to detect nrDNA pseudogenes remains a perplexing issue. This is partly because of problems with how pseudogenes have been defined (above) and partly due to the continuum of change found between obvious functional copies and obvious pseudogenes (i.e., sequences that have lost functional constraint but have not extensively diverged from functional copies).
The methods that have been used to detect pseudogene nrDNA sequences rely on examining attributes of sequences that are presumably
Nucleotide diversification
Patterns of nucleotide substitution have been used as evidence to distinguish between functionally constrained and unconstrained nrDNA sequences (e.g., Buckler and Holtsford, 1996; Buckler et al., 1997; Hershkovitz et al., 1999; Hughes et al., 2002; Mayol and Rosselló, 2001; Yang et al., 1999). In most cases, this approach has relied on pairwise methods to detect degrees of divergence based on the assumption that functional sequences are under strong selective constraints that limit their
Data sets and phylogenetic analysis
Two previously published nrDNA ITS sequence matrices that included putatively functional and non-functional sequences (Hartmann et al. (2001) for Lophocereus (Cactaceae) and Yang et al. (1999) for Brassicaceae) were selected to: (1) provide examples of some of the issues discussed above; (2) to illustrate application of a tree-based approach to pseudogene detection; and (3) to highlight the critical importance of extensive intra-individual, intraspecific, and interspecific sampling in nrDNA
Lophocereus
The tree-based reanalysis of the available Lophocereus ITS 1 and 5.8S data (Hartmann et al., 2001), failed to reject the pseudogene null hypothesis for the branches subtending both major clades of nrDNA sequences (branches 43, 51, and 52 in Fig. 2A and Table 1) as well as for the one outgroup (Pachycereus) branch tested (branch 1 in Fig. 2A and Table 1). As branches 43 and 51 are descendants of branch 52, they act as tests of the prediction that the descendants of a branch showing the
Conclusions
The existence of intra-individual nrDNA polymorphism raises a series of important issues that need to be considered when reconstructing phylogenetic relationships, particularly among closely related species. Here we have focused on clarifying issues related to nrDNA paralogy and pseudogenes. Previous discussions of nrDNA evolution unjustifiably assumed that pseudogenes are interspecific deep paralogs of functional loci in studies that include more than one species (Buckler et al., 1997; Mayol
Acknowledgements
The authors thank Jeff Doyle, Helga Ochoterena, and Mark Simmons for providing comments prior to manuscript submission as well as Jonathan Wendel for sending a prepublication copy of Álvarez and Wendel (2003). We are grateful to reviewer Eric Roalson and an anonymous reviewer for helpful comments on the manuscript. We are also grateful to another anonymous reviewer for suggesting comparisons with likelihood estimates of branch length and examination of how changes in tree topology might affect
References (79)
- et al.
Ribosomal ITS sequences and plant phylogenetic inference
Mol. Phylogenet. Evol.
(2003) - et al.
Molecular phylogenetics of subfamily Calamoideae (Palmae) based on nrDNA ITS and cpDNA rps16 intron sequence data
Mol. Phylogenet. Evol.
(2000) Phylogenetic utility of the internal transcribed spacers of nuclear ribosomal DNA in plants: an example from the Compositae
Mol. Phylogenet. Evol.
(1992)- et al.
The complete external transcribed spacer of 18S–26S rDNA: amplification and phylogenetic utility at low taxonomic levels in Asteraceae and closely allied families
Mol. Phylogenet. Evol.
(2000) - et al.
Why nuclear ribosomal DNA spacers (ITS) tell different stories in Quercus
Mol. Phylogenet. Evol.
(2001) - et al.
A test for a difference between spectral peak frequencies
Comput. Stat. Data Anal.
(1999) - et al.
Noise
Cladistics
(1999) - et al.
Molecular phylogenetic studies of Brassica, Rorippa, Arabidopsis and allied genera based on the internal transcribed spacer region of 18S–25S rDNA
Mol. Phylogenet. Evol.
(1999) - et al.
On the origins of the tetraploid Bromus species: insights from ITS sequences of nrDNA
Genome
(1997) Concerted evolution in multigene families