Large-scale analysis of pseudogenes in the human genome

https://doi.org/10.1016/j.gde.2004.06.003Get rights and content

Abstract

Pseudogenes are considered as genomic fossils: disabled copies of functional genes that were once active in the ancient genome. Recently, whole-genome computational approaches have revealed thousands of pseudogenes in the genomes of the human and other eukaryotes. Identification of these pseudogenes can improve the accuracy of gene annotation. It also offers new insight on the evolutionary history and the stability of the genome as a whole.

Introduction

Mammalian genomes, such as human and mouse, contain a large number of gene-like sequences called pseudogenes. These pseudogenes are inheritable, non-functional, gene homologies that are generally disabled at the transcriptional level 1., 2.••. In most cases, pseudogenes cannot produce transcripts as a result of functional promoter scarcity. Very rarely, some pseudogenes have either retained or acquired a functional promoter so they can be transcribed, but these transcripts are not translated because of a lack of translational or splicing signal sequences. As the result of their non-functionality, pseudogenes are generally released from selective pressure and often accumulate mutations such as frameshifts, in-frame stop codons, or interspersed repeats in the original protein-coding sequence (CDS) (see Figure 1). Consequently, we can identify pseudogenes operationally through finding regions of homology that have these non-gene-like features (Table 1).

Depending on the mechanism by which they were generated, the majority of the mammalian pseudogenes can be divided into duplicated pseudogenes and retrotransposed pseudogenes (also called processed pseudogenes). Duplicated pseudogenes arose from tandem duplication or unequal crossing-over, thus they often have retained the original exon–intron structures of the parental genes, although sometimes incompletely. By contrast, retrotransposed pseudogenes were created from retrotransposition: the reverse transcription of the mRNA transcript followed by integration into the genome 3., 4.. Therefore, retrotransposed pseudogenes are often considered as a special type of retrotransposon, just like long interspersed nuclear elements (LINEs) and short interspersed nuclear elements (SINEs) in the mammalian genomes [5]. Retrotransposed pseudogenes also share some of the common characteristics of the LINEs and SINEs, which include a complete lack of introns, the presence of small flanking direct repeats, and a polyadenine tail near the 3′-end. Because of their close homology to functional genes, pseudogenes often introduce errors or contaminations in the sequence databases (Figure 1). In addition to retrotransposed and duplicated pseudogenes, other types of pseudogenes also exist in the human genome (see below).

Over the years, pseudogenes have been comprehensively surveyed in several completely sequenced genomes (Table 2). In 2002, a preliminary survey reported ∼400 pseudogenes on the two smallest human chromosomes, 21 and 22 [6]. Several other studies have focused on the pseudogene population of selected gene families 7.•, 8., 9., 10.. 2003 proved to be an exciting year, as three groups independently published comprehensive surveys of pseudogenes in the entire human genome 11.••, 12.••, 13.••. It was also discovered in the same year that a mouse pseudogene actually has a regulatory role [14].

Section snippets

Whole-genome identification of pseudogenes

Traditionally, pseudogenes were often discovered as by-products of gene sequencing or PCR experiments. It is only after the whole-genome sequencing projects that large numbers of pseudogenes were identified and annotated. Using a homology-based approach, Zhang et al. [12••] identified ∼8000 retrotransposed pseudogenes and ∼3000 duplicated pseudogenes in the human genome draft (Build 28, April, 2002 release). Ohshima et al. [11••] used basically the same approach in their survey except that they

Exact number of the pseudogenes in the human genome

It is a little surprising that the total numbers of human pseudogenes reported by the three research groups are quite different. Much of the discrepancy can be attributed to the different criteria used by individual groups. Ohshima et al. [11••] applied the most stringent criteria in their procedures as they only presented those pseudogenes that are 90% complete in comparison with their parental genes. Zhang et al. [12••] counted those candidates that are 70% complete in coding region as

Pseudogenes in other organisms

In addition to human, large numbers of pseudogenes were also identified in the genomes of other eukaryotes including the nematode worm [17], budding yeast [18], puffer fish [19], and fruitfly [20]. Some prokaryotic genomes also reportedly have many pseudogenes 21., 22., 23.. Generally, pseudogenes are less common in prokaryotes because their genomes are more compact and have higher DNA deletion rates [24].

The initial annotation of the mouse genome reported ∼14,000 putative pseudogenes [25]. A

Retrotransposed pseudogenes are special types of retrotransposons

The human genome contains several millions copies of LINE and SINE elements that comprise >30% of the entire human genomic DNA [5]. Whereas LINEs are autonomous (i.e. they can retrotranspose their own transcripts), SINEs have to rely on active LINEs to propagate. It is believed that LINE retrotransposons are also responsible for mobilizing mRNA transcripts and generating retrotransposed pseudogenes [4]. Macroscopically, the distribution of the retrotransposed pseudogenes in the human genome is

Highly expressed genes tend to have multiple retrotransposed pseudogenes

The number of retrotransposed pseudogenes per gene is highly uneven in the human genome. In fact, only 10% of human genes have at least one retrotransposed pseudogene identified [11••,12••]. Ribosomal proteins, which have 79 genes in the human genome, account for nearly 20% of the entire retrotransposed pseudogenes population [7]. Other genes that have multiple retrotransposed pseudogenes include housekeeping genes, genes that code for structure protein and metabolic enzymes. In general, the

Pseudogenes as tools to study gene and genome evolution

Pseudogenes are often considered as ‘genomic fossils’ because they provide glimpses of genes that were active millions of years ago. They can be analyzed to infer the evolutionary history of particular genes or gene families. By comparing the sequences of human cytochrome c (cyc) pseudogenes with the functional cyc gene from human and mouse, it became obvious that accelerated evolution in cyc had occurred in the primate lineage leading to human [31]. In another case, it is found that the

Some pseudogenes are transcribed

Because pseudogenes have high sequence similarity with their parental genes, they can potentially introduce contaminations in hybridization or amplification experiments. Special cautions need to be taken to prevent such interferences [40]. It has been reported that a cytokeratin-19 pseudogene may have interfered with diagnostic assays used to detect micrometastatic tumor cells [41]. In another instance, a novel pseudogene of phox, a component of phagocyte NADPH oxidase complex, complicates the

Potential functional roles of pseudogenes

Because of their close similarities to the functional genes and high level of sequence conservation, pseudogenes, especially those that are transcribed, have been hypothesized as having regulatory roles [46]. Korneev, Park and O’Shea [47] have reported that, in the neurons of mollusk Lymnaea stagnalis, a transcribed pseudogene of neural nitric oxide synthase (nNOS) suppresses the synthesis of nNOS protein in an RNAi-like mechanism [47]. The transcript of the pseudogene contains a region with

Conclusions

Pseudogenes are ubiquitous and abundant in mammalian genomes. Their importance and implications have captured the interests of researchers from very diverse disciplines. The fact that pseudogenes have regulatory roles further demonstrates that these sequences should not be treated as ‘junk DNA’. With more mammalian genomes, such as that of chimpanzee, being sequenced, a more complete picture of pseudogenes and their functions is starting to emerge.

References and recommended reading

Papers of particular interest, published within the annual period of review, have been highlighted as:

  • of special interest

  • ••

    of outstanding interest

Acknowledgements

M Gerstein acknowledges financial support from the National Institutes of Health (NP50 HG02357–01). Z Zhang acknowledges Paul Harrison and Duncan Milburn for helpful discussions.

References (57)

  • J. Zhang et al.

    The third myrosinase gene TGG3 in Arabidopsis thaliana is a pseudogene specifically expressed in stamen and petal

    Physiol Plant

    (2002)
  • E. Vargas-Madrazo et al.

    Structural repertoire in VH pseudogenes of immunoglobulins: comparison with human germline genes and human amino acid sequences

    J Mol Biol

    (1995)
  • N. Echols et al.

    Comprehensive analysis of amino acid and nucleotide composition in eukaryotic genomes, comparing genes and pseudogenes

    Nucleic Acids Res

    (2002)
  • A. Mounsey et al.

    Evidence Suggesting That a Fifth of Annotated Caenorhabditis elegans Genes May Be Pseudogenes

    Genome Res

    (2002)
  • E.S. Balakirev et al.

    Pseudogenes: are they “junk” or functional DNA?

    Annu Rev Genet

    (2003)
  • J. Maestre et al.

    mRNA retroposition in human cells: processed pseudogene formation

    EMBO J

    (1995)
  • C. Esnault et al.

    Human LINE retrotransposons generate processed pseudogenes

    Nat Genet

    (2000)
  • P.L. Deininger et al.

    Mobile elements and mammalian genome evolution

    Curr Opin Genet Dev

    (2003)
  • Z. Zhang et al.

    Identification and analysis of over 2000 ribosomal protein pseudogenes in the human genome

    Genome Res

    (2002)
  • G. Glusman et al.

    The complete human olfactory subgenome

    Genome Res

    (2001)
  • Y. Tourmen et al.

    Structure and chromosomal distribution of human mitochondrial pseudogenes

    Genomics

    (2002)
  • M. Woischnik et al.

    Pattern of organization of human mitochondrial pseudogenes in the nuclear genome

    Genome Res

    (2002)
  • K. Ohshima et al.

    Whole-genome screening indicates a possible burst of formation of processed pseudogenes and Alu repeats by particular L1 subfamilies in ancestral primates

    Genome Biol

    (2003)
  • Z. Zhang et al.

    Millions of years of evolution preserved: a comprehensive catalog of the processed pseudogenes in the human genome

    Genome Res

    (2003)
  • S. Hirotsune et al.

    An expressed pseudogene regulates the messenger-RNA stability of its homologous coding gene

    Nature

    (2003)
  • L.D. Hurst

    The Ka/Ks ratio: diagnosing the form of sequence evolution

    Trends Genet

    (2002)
  • International Human Genome Sequencing Consortium: Initial sequencing and analysis of the human genome. Nature 2001,...
  • P.M. Harrison et al.

    Digging for dead genes: an analysis of the characteristics of the pseudogene population in the Caenorhabditis elegans genome

    Nucleic Acids Res

    (2001)
  • Cited by (0)

    View full text