Pash: Efficient Genome-Scale Sequence Anchoring by Positional Hashing

  1. Ken J. Kalafus1,2,
  2. Andrew R. Jackson2, and
  3. Aleksandar Milosavljevic1,2,3,4
  1. 1 Program in Structural and Computational Biology and Molecular Biophysics, Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, Texas, 77030, USA
  2. 2 Bioinformatics Research Laboratory, Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, Texas, 77030, USA
  3. 3 Human Genome Sequencing Center, Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, Texas, 77030, USA

Abstract

Pash is a computer program for efficient, parallel, all-against-all comparison of very long DNA sequences. Pash implements Positional Hashing, a novel parallelizable method for sequence comparison based on k-mer representation of sequences. The Positional Hashing method breaks the comparison problem in a unique way that avoids the quadratic penalty encountered with other sensitive methods and confers inherent low-level parallelism. Furthermore, Positional Hashing allows one to readily and predictably trade between sensitivity and speed. In a simulated comparison task, anchoring computationally mutated reads onto a genome, the sensitivity of Pash was equal to or greater than that of BLAST and BLAT, with Pash outperforming these programs as the reads became shorter and less similar to the genome. Using modest computing resources, we employed Pash for two large-scale sequence comparison tasks: comparison of three mammalian genomes, and anchoring millions of chimpanzee whole-genome shotgun sequencing reads onto the human genome. The results of these comparisons by Pash agree with those computed by other methods that use more than an order of magnitude more computing resources. These results confirm the sensitivity of Positional Hashing.

Footnotes

  • Article and publication are at http://www.genome.org/cgi/doi/10.1101/gr.1963804.

  • 4 Corresponding author. E-MAIL amilosav{at}bcm.tmc.edu; FAX (713) 798-4373.

    • Accepted December 27, 2003.
    • Received September 10, 2003.
| Table of Contents

Preprint Server