A SNP discovery method to assess variant allele probability from next-generation resequencing data

Yufeng Shen; Zhengzheng Wan; Cristian Coarfa; Rafal Drabek; Lei Chen; Elizabeth A. Ostrowski; Yue Liu; George M. Weinstock; David A. Wheeler; Richard A. Gibbs; Fuli Yu

doi:10.1101/gr.096388.109

A SNP discovery method to assess variant allele probability from next-generation resequencing data

¹ The Human Genome Sequencing Center, Baylor College of Medicine, Houston, Texas 77030, USA;
² Graduate Program of Structural and Computational Biology and Molecular Biophysics, Baylor College of Medicine, Houston, Texas 77030, USA;
³ Department of Ecology and Evolutionary Biology, Rice University, Houston, Texas 77005, USA;
⁴ The Genome Center, Washington University, St. Louis, Missouri 63108, USA

↵5 Present address: Center for Computational Biology and Bioinformatics, and Department of Computer Science, Columbia University, 1130 St. Nicholas Avenue, New York, NY 10032, USA.

↵6 These authors contributed equally to this work.

Abstract

Accurate identification of genetic variants from next-generation sequencing (NGS) data is essential for immediate large-scale genomic endeavors such as the 1000 Genomes Project, and is crucial for further genetic analysis based on the discoveries. The key challenge in single nucleotide polymorphism (SNP) discovery is to distinguish true individual variants (occurring at a low frequency) from sequencing errors (often occurring at frequencies orders of magnitude higher). Therefore, knowledge of the error probabilities of base calls is essential. We have developed Atlas-SNP2, a computational tool that detects and accounts for systematic sequencing errors caused by context-related variables in a logistic regression model learned from training data sets. Subsequently, it estimates the posterior error probability for each substitution through a Bayesian formula that integrates prior knowledge of the overall sequencing error probability and the estimated SNP rate with the results from the logistic regression model for the given substitutions. The estimated posterior SNP probability can be used to distinguish true SNPs from sequencing errors. Validation results show that Atlas-SNP2 achieves a false-positive rate of lower than 10%, with an ∼5% or lower false-negative rate.

Footnotes

↵7 Corresponding authors.

E-mail yshen{at}c2b2.columbia.edu; fax (212) 851-5149.

E-mail fyu{at}bcm.edu; fax (713) 798-5741.
[Supplemental material is available online at http://www.genome.org. Atlas-SNP2 and its documentation are available for download at http://www.hgsc.bcm.tmc.edu/cascade-tech-software-ti.hgsc.]
Article published online before print. Article and publication date are at http://www.genome.org/cgi/doi/10.1101/gr.096388.109.
- Received May 21, 2009.
- Accepted November 20, 2009.