Characterization of angiosperm nrDNA polymorphism, paralogy, and pseudogenes

https://doi.org/10.1016/j.ympev.2003.08.021Get rights and content

Abstract

Many early reports of ITS region (ITS 1, 5.8S, and ITS 2) variation in flowering plants indicated that nrDNA arrays within individuals are homogeneous. However, both older and more recent studies have found intra-individual nrDNA polymorphism across a range of plant taxa including presumed non-hybrid diploids. In addition, polymorphic individuals often contain potentially non-functional nrDNA copies (pseudogenes). These findings suggest that complete concerted evolution should not be assumed when embarking on phylogenetic studies using nrDNA sequences. Here we (1) discuss paralogy in relation to species tree reconstruction and conclude that a priori determinations of orthology and paralogy of nrDNA sequences should not be made based on the functionality or lack of functionality of those sequences; (2) discuss why systematists might be particularly interested in identifying and including pseudogene sequences as a test of gene tree sampling; (3) examine the various definitions and characterizations of nrDNA pseudogenes as well as the relative merits and limitations of a subset of pseudogene detection methods and conclude that nucleotide substitution patterns are particularly appropriate for the identification of putative nrDNA pseudogenes; and (4) present and discuss the advantages of a tree-based approach to identifying pseudogenes based on comparisons of sequence substitution patterns from putatively conserved (e.g., 5.8S) and less constrained (e.g., ITS 1 and ITS 2) regions. Application of this approach, through a method employing bootstrap hypothesis testing, and the issues discussed in the paper are illustrated through reanalysis of two previously published matrices. Given the apparent robustness of the test developed and the ease of carrying out percentile bootstrap hypothesis tests, we urge researchers to employ this statistical tool. While our discussion and examples concern the literature on plant systematics, the issues addressed are relevant to studies of nrDNA and other multicopy genes in other taxa.

Introduction

Sequences of the internal transcribed spacer region (ITS 1, 5.8S, and ITS 2) of nuclear ribosomal DNA (nrDNA) have been a staple source of data for the study of lower level phylogenetic relationships among plant taxa for more than ten years (e.g., Baldwin, 1992; Baldwin et al., 1995). In fact, the ITS region is one of the most widely applied molecular markers in current angiosperm systematics (e.g., Hershkovitz et al., 1999; Álvarez and Wendel, 2003). Some early reports of ITS variation provided results that were consistent with the existence of homogeneous nrDNA arrays within individuals (e.g., Ainouche and Bayer, 1997; Baldwin et al., 1995) presumably resulting from concerted evolution (gene conversion and unequal crossing over (Arnheim, 1983)). The effects of complete concerted evolution have been clearly documented between homeologous loci in some allotetraploid taxa of Gossypium (Wendel et al., 1995) and Paeonia (Sang et al., 1995). As a result, intra-individual polymorphism has generally been considered to be the exception rather than the rule for nrDNA (Mayol and Rosselló, 2001). Nevertheless, some studies have identified the occurrence of intra-individual nrDNA polymorphism in a range of taxa including non-hybrid diploids and allopolyploids (e.g., Baker et al., 2000; Buckler et al., 1997; Campbell et al., 1997; Denduangboripant and Cronk, 2000; Doyle et al., 1990; Fuertes-Aguilar et al., 1999; Gaut et al., 2000; Gernandt and Liston, 1999; Hartmann et al., 2001; Hughes et al., 2002; Jobst et al., 1998; Kita and Ito, 2000; Kuzoff et al., 1999; Learn and Schaal, 1987; Linder et al., 2000; Mayol and Rosselló, 2001; Muir et al., 2001; O’Kane et al., 1996; Sang et al., 1995; Suh et al., 1993; Vargas et al., 1999; Widmer and Baltisberger, 1999). It has also become clear that polymorphic individuals often contain potentially non-functional nrDNA copies (pseudogenes) in addition to functional copies (e.g., Buckler et al., 1997; Hartmann et al., 2001; Hughes et al., 2002; Kita and Ito, 2000; Mayol and Rosselló, 2001; Muir et al., 2001; Yang et al., 1999). Clearly complete concerted evolution can no longer be assumed when embarking on studies utilizing nrDNA sequences (ITS, ETS, 5.8S, 18S, or 26S) for phylogenetic analysis of plant taxa.

The implications and importance of nrDNA polymorphism and pseudogenes for phylogenetic analyses of plant nrDNA sequences were discussed in the groundbreaking work of Buckler et al. (1997) and more recently by Mayol and Rosselló (2001). However, several major issues relating to the characterization of nrDNA polymorphism and pseudogenes, and their implications for species level phylogeny reconstruction remain ambiguous. These issues involve how to determine orthology and paralogy of nrDNA sequences, how to define and detect nrDNA pseudogenes, and the desirability of using pseudogenes in studies of phylogenetic relationship. For example, views on the use of pseudogene sequences in phylogenetic analysis of species and higher taxa include both deliberate a priori exclusion (e.g., Yang et al., 1999) and explicit inclusion (e.g., Buckler et al., 1997; Hughes et al., 2002).

Here we review and attempt to clarify the issues surrounding nrDNA polymorphism, pseudogenes, and species tree reconstruction. In the first section, we examine the relationship between intra-individual polymorphism in nrDNA and levels of paralogy, arguing that gene tree reconstruction is essential for understanding the potential complexity of nrDNA evolution and for determining orthology and interspecific paralogy. We distinguish between “shallow paralogs” and “deep paralogs” in order to clarify how nrDNA polymorphism and paralogy can affect the inference of species trees from gene trees. We then illustrate why orthology and paralogy of sequences from different individuals should not be inferred from the potential functionality of those sequences, arguing again that gene tree analyses are essential.

In subsequent sections we focus on nrDNA pseudogenes more closely, particularly noting that the identification of pseudogenes can act as a test of nrDNA sampling and that a priori exclusion of pseudogenes from gene tree analyses is generally unjustified. Definitions of pseudogenes and methods of detecting pseudogenes applicable to nrDNA are discussed. We conclude that expression is a poor criterion for identifying nrDNA pseudogenes and that patterns of nucleotide substitution are more appropriate in the context of phylogenetic systematics. We explore the relevance, reliability, and effectiveness of pseudogene detection methods which examine patterns of nucleotide substitution and present a formalized tree-based approach. Examples illustrating these issues and emphasizing the importance of sampling are provided through the reanalysis of nrDNA ITS data sets from Lophocereus Britton & Rose (Hartmann et al., 2001) and Brassicaceae (Yang et al., 1999) using a tree-based approach that relies on bootstrap hypothesis testing. Both of these data sets include putatively functional and non-functional copies. We show that the tree-based method has a number of general advantages. Compared to other approaches, it was more powerful at detecting pseudogenes, it revealed complex substitution patterns across gene trees that suggested a much broader range of evolutionary mechanisms, and it was useful for detecting errors such as long branch attraction.

Section snippets

Nuclear rDNA polymorphism, orthology, paralogy, and species trees

The reconstruction of phylogenetic relationships among species and higher taxa (species-trees) using data from multicopy sequences, such as nrDNA, depends upon a clear understanding of sequence relationships (gene trees: e.g., Avise, 1989; Doyle, 1992; Goodman et al., 1979; Pamilo and Nei, 1988; Sanderson and Doyle, 1992). If sequences are mistaken as orthologous (derived from speciation events), when they are in fact paralogous (derived from gene duplication events), relationships among

Orthology, paralogy, and functionality

Our argument in this section is that orthology and paralogy should not be inferred on the basis of sequence functionality. Because pseudogene copies of nrDNA can be easily amplified, researchers are faced with the question of their relationship to the functional copies from which they were likely derived, specifically, whether they represent deep or shallow paralogs. It is easy to assume that pseudogenes will be interspecific paralogs to functional copies, but this need not be the case. Gene

When are nrDNA polymorphisms likely to be of interest?

All known examples of interspecifically maintained nrDNA paralogous polymorphism occur among closely related species or, in a few cases, between genera. To the best of our knowledge, shared polymorphisms in more distantly related taxa have not been directly uncovered in studies of plant nrDNA. For example, there is no evidence of deep maintained paralogs between conifers and Arabidopsis. For pseudogene related polymorphisms this is hardly surprising given that extensive divergence would

Why are nrDNA pseudogenes of interest?

In a subset of the cases in which intra-individual polymorphisms in nrDNA have been identified, the potential functionality of sequences has also been assessed. These studies suggest that putatively functional and non-functional sequences are both commonly amplified when nrDNA polymorphisms are encountered (e.g., Buckler et al., 1997). Furthermore, the dynamics of amplification (involving copy number, secondary structure, and primer site conservation) can result in the preferential

Definitions of nrDNA pseudogenes

Standard definitions of “pseudogene” are difficult to apply to nrDNA. Consider the following common definitions: “a silent, non-functional DNA sequence,” “non-functional genes related in sequence to functional genes,” and “sequences which resemble the functional genes with which they are associated, but which differ at a number of base pair sites and are not transcribed because they have internal “stop” codons” (all in Futuyma, 1998). These definitions, or aspects of them, are problematic when

Detecting pseudogenes

How to detect nrDNA pseudogenes remains a perplexing issue. This is partly because of problems with how pseudogenes have been defined (above) and partly due to the continuum of change found between obvious functional copies and obvious pseudogenes (i.e., sequences that have lost functional constraint but have not extensively diverged from functional copies).

The methods that have been used to detect pseudogene nrDNA sequences rely on examining attributes of sequences that are presumably

Nucleotide diversification

Patterns of nucleotide substitution have been used as evidence to distinguish between functionally constrained and unconstrained nrDNA sequences (e.g., Buckler and Holtsford, 1996; Buckler et al., 1997; Hershkovitz et al., 1999; Hughes et al., 2002; Mayol and Rosselló, 2001; Yang et al., 1999). In most cases, this approach has relied on pairwise methods to detect degrees of divergence based on the assumption that functional sequences are under strong selective constraints that limit their

Data sets and phylogenetic analysis

Two previously published nrDNA ITS sequence matrices that included putatively functional and non-functional sequences (Hartmann et al. (2001) for Lophocereus (Cactaceae) and Yang et al. (1999) for Brassicaceae) were selected to: (1) provide examples of some of the issues discussed above; (2) to illustrate application of a tree-based approach to pseudogene detection; and (3) to highlight the critical importance of extensive intra-individual, intraspecific, and interspecific sampling in nrDNA

Lophocereus

The tree-based reanalysis of the available Lophocereus ITS 1 and 5.8S data (Hartmann et al., 2001), failed to reject the pseudogene null hypothesis for the branches subtending both major clades of nrDNA sequences (branches 43, 51, and 52 in Fig. 2A and Table 1) as well as for the one outgroup (Pachycereus) branch tested (branch 1 in Fig. 2A and Table 1). As branches 43 and 51 are descendants of branch 52, they act as tests of the prediction that the descendants of a branch showing the

Conclusions

The existence of intra-individual nrDNA polymorphism raises a series of important issues that need to be considered when reconstructing phylogenetic relationships, particularly among closely related species. Here we have focused on clarifying issues related to nrDNA paralogy and pseudogenes. Previous discussions of nrDNA evolution unjustifiably assumed that pseudogenes are interspecific deep paralogs of functional loci in studies that include more than one species (Buckler et al., 1997; Mayol

Acknowledgements

The authors thank Jeff Doyle, Helga Ochoterena, and Mark Simmons for providing comments prior to manuscript submission as well as Jonathan Wendel for sending a prepublication copy of Álvarez and Wendel (2003). We are grateful to reviewer Eric Roalson and an anonymous reviewer for helpful comments on the manuscript. We are also grateful to another anonymous reviewer for suggesting comparisons with likelihood estimates of branch length and examination of how changes in tree topology might affect

References (79)

  • J.C. Avise

    Gene-trees and organismal histories: a phylogenetic approach to population biology

    Evolution

    (1989)
  • Bailey, C.D., Hughes, C.E., Harris, S.A., in press. Using RAPDs to identify DNA sequence loci for species level...
  • B.G. Baldwin et al.

    The ITS region of nuclear ribosomal DNA: a valuable source of evidence on angiosperm phylogeny

    Ann. Mo. Bot. Gard.

    (1995)
  • E.S.I. Buckler et al.

    Zea ribosomal repeat evolution and substitution patterns

    Mol. Biol. Evol.

    (1996)
  • E.S.I. Buckler et al.

    The evolution of ribosomal DNA: divergent paralogues and phylogenetic implications

    Genetics

    (1997)
  • C.S. Campbell et al.

    Persistent nuclear ribosomal DNA sequence polymorphism in the Amelanchier agamic complex (Rosaceae)

    Mol. Biol. Evol.

    (1997)
  • M.R. Chernick

    Bootstrap Methods: A Practitioner’s Guide

    (1999)
  • D. Choi et al.

    The expression of pseudogene cyclin D2 mRNA in the human ovary may be a novel marker for decreased ovarian function associated with the aging process

    J. Assist. Reprod. Gen.

    (2001)
  • J. Davis et al.

    Data decisiveness, data quality and incongruence in phylogenetic analysis: an example from the monocotyledons using mitochondrial atpA sequences

    Syst. Biol.

    (1998)
  • J. Denduangboripant et al.

    High intraindividual variation in internal transcribed spacer sequences in Aeschynanthus (Gesneriaceae) implications for phylogenetics

    Proc. R. Soc. Lond. B

    (2000)
  • J. Dopazo

    Estimating errors and confidence intervals for branch lengths in phylogenetic trees by a bootstrap approach

    J. Mol. Evol.

    (1994)
  • J.J. Doyle

    Gene trees and species trees: molecular systematics as one-character taxonomy

    Syst. Bot.

    (1992)
  • J.J. Doyle et al.

    Homology in molecular phylogenetics: a parsimony perspective

  • J.J. Doyle et al.

    Analysis of a polyploid complex in Glycine with chloroplast and nuclear DNA

    Aust. Syst. Bot.

    (1990)
  • J. Felsenstein

    Confidence limits on phylogenies: an approach using the bootstrap

    Evolution

    (1985)
  • J. Felsenstein

    Cases in which parsimony and compatibility methods will be positively mis-leading

    Syst. Zool.

    (1978)
  • Felsenstein, J., 1993. PHYLIP (Phylogeny Inference Package) version 3.5c. Distributed by the author. Department of...
  • J. Fuertes-Aguilar et al.

    Nuclear ribosomal DNA (nrDNA) concerted evolution in natural and artificial hybrids of Armeria (Plumbaginaceae)

    Mol. Ecol.

    (1999)
  • D.J. Futuyma

    Evolutionary Biology

    (1998)
  • M. Gardiner-Garden et al.

    Methylation sites in angiosperm genes

    J. Mol. Evol.

    (1992)
  • B.S. Gaut et al.

    Phylogenetic relationships and genetic diversity among members of the Festuca-Lolium complex (Poaceae) based on ITS sequence data

    Plant Syst. Evol.

    (2000)
  • D.S. Gernandt et al.

    Internal transcribed spacer region evolution in Larix and Pseudotsuga (Pinaceae)

    Am. J. Bot.

    (1999)
  • Goloboff, P., 2000. NONA: a tree searching program. Available from...
  • J. Goodman et al.

    Fitting the gene lineage into its species lineage: a parsimony strategy illustrated by cladograms constructed from globin sequences

    Syst. Zool.

    (1979)
  • D. Graur et al.

    Fundamentals of Molecular Evolution

    (2000)
  • P. Hall et al.

    Two guidelines for bootstrap hypothesis testing

    Biometrics

    (1991)
  • R.K. Hamby et al.

    Ribosomal RNA as a phylogenetic tool in plant systematics

  • S. Hartmann et al.

    Extensive ribosomal DNA genic variation in the columnar cactus Lophocereus

    J. Mol. Evol.

    (2001)
  • M.A. Hershkovitz et al.

    Ribosomal DNA sequences and angiosperm systematics

  • Cited by (0)

    View full text