Identification and Characterization of Multi-Species Conserved Sequences

  1. Elliott H. Margulies1,
  2. Mathieu Blanchette3,
  3. NISC Comparative Sequencing Program1,2,
  4. David Haussler3,4,5, and
  5. Eric D. Green1,2,5
  1. 1 Genome Technology Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, Maryland 20892, USA
  2. 2 NIH Intramural Sequencing Center (NISC), National Human Genome Research Institute, National Institutes of Health, Bethesda, Maryland 20892, USA
  3. 3 Center for Biomolecular Science and Engineering, University of California, Santa Cruz, California 95964, USA
  4. 4 Howard Hughes Medical Institute, University of California, Santa Cruz, California 95964, USA

Abstract

Comparative sequence analysis has become an essential component of studies aiming to elucidate genome function. The increasing availability of genomic sequences from multiple vertebrates is creating the need for computational methods that can detect highly conserved regions in a robust fashion. Towards that end, we are developing approaches for identifying sequences that are conserved across multiple species; we call these “Multi-species Conserved Sequences” (or MCSs). Here we report two strategies for MCS identification, demonstrating their ability to detect virtually all known actively conserved sequences (specifically, coding sequences) but very little neutrally evolving sequence (specifically, ancestral repeats). Importantly, we find that a substantial fraction of the bases within MCSs (∼70%) resides within non-coding regions; thus, the majority of sequences conserved across multiple vertebrate species has no known function. Initial characterization of these MCSs has revealed sequences that correspond to clusters of transcription factor-binding sites, non-coding RNA transcripts, and other candidate functional elements. Finally, the ability to detect MCSs represents a valuable metric for assessing the relative contribution of a species' sequence to identifying genomic regions of interest, and our results indicate that the currently available genome sequences are insufficient for the comprehensive identification of MCSs in the human genome.

Footnotes

  • Article and publication are at http://www.genome.org/cgi/doi/10.1101/gr.1602203.

  • 5 Corresponding authors. E-MAIL egreen{at}nhgri.nih.gov; FAX (301)402-2040. E-MAIL haussler{at}cse.ucsc.edu; FAX (831)459-4829.

    • Accepted September 5, 2003.
    • Received May 29, 2003.
| Table of Contents

Preprint Server