Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Technical Report
  • Published:

Automating sequence-based detection and genotyping of SNPs from diploid samples

Abstract

The detection of sequence variation, for which DNA sequencing has emerged as the most sensitive and automated approach, forms the basis of all genetic analysis. Here we describe and illustrate an algorithm that accurately detects and genotypes SNPs from fluorescence-based sequence data. Because the algorithm focuses particularly on detecting SNPs through the identification of heterozygous individuals, it is especially well suited to the detection of SNPs in diploid samples obtained after DNA amplification. It is substantially more accurate than existing approaches and, notably, provides a useful quantitative measure of its confidence in each potential SNP detected and in each genotype called. Calls assigned the highest confidence are sufficiently reliable to remove the need for manual review in several contexts. For example, for sequence data from 47–90 individuals sequenced on both the forward and reverse strands, the highest-confidence calls from our algorithm detected 93% of all SNPs and 100% of high-frequency SNPs, with no false positive SNPs identified and 99.9% genotyping accuracy. This algorithm is implemented in a software package, PolyPhred version 5.0, which is freely available for academic use.

This is a preview of subscription content, access via your institution

Access options

Buy this article

Prices may be subject to local taxes which are calculated during checkout

Figure 1: Sequence traces (chromatograms) for four individuals.
Figure 2: Removal of systematic variation in peak height improves discrimination between heterozygotes and homozygotes.
Figure 3: Missed SNP rate versus false discovery rate for different data sets.
Figure 4: Dependence of performance on sequence quality.

Similar content being viewed by others

References

  1. Carlson, C.S., Newman, T.L. & Nickerson, D.A. SNPing in the human genome. Curr. Opin. Chem. Biol. 5, 78–85 (2001).

    Article  CAS  Google Scholar 

  2. Nickerson, D.A., Tobe, V.O. & Taylor, S.L. PolyPhred: automating the detection and genotyping of single nucleotide substitutions using fluorescence-based resequencing. Nucleic Acids Res. 25, 2745–2751 (1997).

    Article  CAS  Google Scholar 

  3. Cargill, M. et al. Characterization of single-nucleotide polymorphisms in coding regions of human genes. Nat. Genet. 22, 231–238 (1999).

    Article  CAS  Google Scholar 

  4. Carlson, C.S. et al. Additional SNPs and linkage-disequilibrium analyses are necessary for whole-genome association studies in humans. Nat. Genet. 33, 518–521 (2003).

    Article  CAS  Google Scholar 

  5. Marth, G.T. et al. A general approach to single-nucleotide polymorphism discovery. Nat. Genet. 23, 452–456 (1999).

    Article  CAS  Google Scholar 

  6. Ning, Z., Cox, A.J. & Mullikin, J.C. SSAHA: a fast search method for large DNA databases. Genome Res. 11, 1725–1729 (2001).

    Article  CAS  Google Scholar 

  7. Ewing, B. & Green, P. Base-calling of automated sequencer traces using Phred. II. Error probabilities. Genome Res. 8, 186–194 (1998).

    Article  CAS  Google Scholar 

  8. Weckx, S. et al. novoSNP, a novel computational tool for sequence variation discovery. Genome Res. 15, 436–442 (2005).

    Article  CAS  Google Scholar 

  9. Kwok, P.Y., Carlson, C., Yager, T.D., Ankener, W. & Nickerson, D.A. Comparative analysis of human DNA variations by fluorescence-based sequencing of PCR products. Genomics 23, 138–144 (1994).

    Article  CAS  Google Scholar 

  10. Parker, L.T. et al. AmpliTaq DNA polymerase, FS dye-terminator sequencing: analysis of peak height patterns. Biotechniques 21, 694–699 (1996).

    Article  CAS  Google Scholar 

  11. Ewing, B., Hillier, L., Wendl, M.C. & Green, P. Base-calling of automated sequencer traces using Phred. I. accuracy assessment. Genome Res. 8, 175–185 (1998).

    Article  CAS  Google Scholar 

  12. Hinds, D.A. et al. Whole-genome patterns of common DNA variation in three human populations. Science 307, 1072–1079 (2005).

    Article  CAS  Google Scholar 

  13. The International HapMap Consortium. A haplotype map of the human genome. Nature 437, 1299–1320 (2005).

  14. Bhangale, T.R., Rieder, M.J., Livingston, R.J. & Nickerson, D.A. Comprehensive identification and characterization of diallelic insertion-deletion polymorphisms in 330 human candidate genes. Hum. Mol. Genet. 14, 59–69 (2005).

    Article  CAS  Google Scholar 

  15. Olden, K. & Wilson, S. Environmental health and genomics: visions and implications. Nat. Rev. Genet. 1, 149–153 (2000).

    Article  CAS  Google Scholar 

  16. Dempster, A.P., Laird, N.M. & Rubin, D.B. Maximum likelihood from incomplete data via the EM algorithm. J. R. Stat. Soc. Ser. B 34, 1–38 (1977).

    Google Scholar 

  17. Gordon, D., Abajian, C. & Green, P. Consed: a graphical tool for sequence finishing. Genome Res. 8, 195–202 (1998).

    Article  CAS  Google Scholar 

Download references

Acknowledgements

The authors thank past and present members of the Nickerson lab for compiling the databases that were used to develop, train and test our algorithm. This work was supported by US National Institutes of Health (NIH) grants (1RO1HG/LM-02585 to M.S., and ES-15478 and HL-66682 to D.A.N.). P.S. was supported by an NIH training grant (T32 HG00035-06).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Matthew Stephens.

Ethics declarations

Competing interests

POLYPHRED is freely available for academic purposes, but a licensing fee is charged for commercial use, which predominantly funds further software and methods development.

Supplementary information

Supplementary Fig. 1

Illustration of how our method removes systematic variation in secondary peak height to improve discrimination between heterozygotes and homozygotes. (PDF 95 kb)

Supplementary Fig. 2

Relationship between score assigned by our method to each genotype and genotyping agreement rate. (PDF 20 kb)

Supplementary Fig. 3

Illustration of tiled and double-coverage sequencing designs. (PDF 23 kb)

Supplementary Fig. 4

Relationship between rates of agreements between genotypes and the percentage of uncalled genotypes, as threshold for calling genotypes varies. (PDF 18 kb)

Supplementary Methods (PDF 83 kb)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Stephens, M., Sloan, J., Robertson, P. et al. Automating sequence-based detection and genotyping of SNPs from diploid samples. Nat Genet 38, 375–381 (2006). https://doi.org/10.1038/ng1746

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1038/ng1746

This article is cited by

Search

Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing