Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Article
  • Published:

Phymm and PhymmBL: metagenomic phylogenetic classification with interpolated Markov models

Abstract

Metagenomics projects collect DNA from uncharacterized environments that may contain thousands of species per sample. One main challenge facing metagenomic analysis is phylogenetic classification of raw sequence reads into groups representing the same or similar taxa, a prerequisite for genome assembly and for analyzing the biological diversity of a sample. New sequencing technologies have made metagenomics easier, by making sequencing faster, and more difficult, by producing shorter reads than previous technologies. Classifying sequences from reads as short as 100 base pairs has until now been relatively inaccurate, requiring researchers to use older, long-read technologies. We present Phymm, a classifier for metagenomic data, that has been trained on 539 complete, curated genomes and can accurately classify reads as short as 100 base pairs, a substantial improvement over previous composition-based classification methods. We also describe how combining Phymm with sequence alignment algorithms improves accuracy.

This is a preview of subscription content, access via your institution

Access options

Rent or buy this article

Prices vary by article type

from$1.95

to$39.95

Prices may be subject to local taxes which are calculated during checkout

Figure 1: Accuracy of Phymm, with species-level matches masked.
Figure 2: PhymmBL's phylum-level population characterization of the AMD data.
Figure 3: PhymmBL's species-level population characterization of the AMD data.

Similar content being viewed by others

References

  1. National Research Council of the National Academies. The dawning of a new microbial age. in The New Science of Metagenomics: Revealing the Secrets of Our Microbial Planet p. 2 (The National Academies Press, Washington, DC, 2007).

  2. Rondon, M.R. et al. Cloning the soil metagenome: a strategy for accessing the genetic and functional diversity of uncultured microorganisms. Appl. Environ. Microbiol. 66, 2541–2547 (2000).

    Article  CAS  Google Scholar 

  3. Krause, L. et al. Phylogenetic classification of short environmental DNA fragments. Nucleic Acids Res. 36, 2230–2239 (2008).

    Article  CAS  Google Scholar 

  4. McHardy, A.C., Martin, H.G., Tsirigos, A., Hugenholtz, P. & Rigoutsos, I. Accurate phylogenetic classification of variable-length DNA fragments. Nat. Methods. 4, 63–72 (2007).

    Article  CAS  Google Scholar 

  5. Kunin, V., Copeland, A., Lapidus, A., Mavromatis, K. & Hugenholtz, P. A bioinformatician's guide to metagenomics. Microbiol. Mol. Biol. Rev. 72, 557–578 (2008).

    Article  CAS  Google Scholar 

  6. Altschul, S.F. et al. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25, 3389–3402 (1997).

    Article  CAS  Google Scholar 

  7. Tringe, S.G. et al. Comparative metagenomics of microbial communities. Science 308, 554–557 (2005).

    Article  CAS  Google Scholar 

  8. Tito, R.Y. et al. Phylotyping and functional analysis of two ancient human microbiomes. PLoS One 3, e3703 (2008).

    Article  Google Scholar 

  9. Huson, D.H., Auch, A.F., Qi, J. & Schuster, S.C. MEGAN analysis of metagenomic data. Genome Res. 17, 377–386 (2007).

    Article  CAS  Google Scholar 

  10. Dinsdale, E.A. et al. Microbial ecology of four coral atolls in the Northern Line Islands. PLoS One 3, e1584 (2008).

    Article  Google Scholar 

  11. Salzberg, S.L., Delcher, A.L., Kasif, S. & White, O. Microbial gene identification using interpolated Markov models. Nucleic Acids Res. 26, 544–548 (1998).

    Article  CAS  Google Scholar 

  12. Delcher, A.L., Bratke, K.A., Powers, E.C. & Salzberg, S.L. Identifying bacterial genes and endosymbiont DNA with Glimmer. Bioinformatics 23, 673–679 (2007).

    Article  CAS  Google Scholar 

  13. Pruitt, K.D., Tatusova, T. & Maglott, D.R. NCBI reference sequences (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins. Nucleic Acids Res. 35 Database issue, D61–D65 (2007).

    Article  CAS  Google Scholar 

  14. Tyson, G.W. et al. Community structure and metabolism through reconstruction of microbial genomes from the environment. Nature 428, 37–43 (2004).

    Article  CAS  Google Scholar 

  15. Bock, E. & Wagner, M. Oxidation of inorganic nitrogen compounds as an energy source. in The Prokaryotes, 3rd edn., vol. 3 (eds., Dworkin, M. and Falkow, S.) 457–495 (Springer, New York, 2006).

    Chapter  Google Scholar 

  16. Chapus, C. et al. Exploration of phylogenetic data using a global sequence analysis method. BMC Evol. Biol. 5, 63 (2005).

    Article  Google Scholar 

  17. Manichanh, C. et al. A comparison of random sequence reads versus 16S rDNA sequences for estimating the biodiversity of a metagenomic library. Nucleic Acids Res. 36, 5180–5188 (2008).

    Article  CAS  Google Scholar 

  18. Mavromatis, K. et al. Use of simulated data sets to evaluate the fidelity of metagenomic processing methods. Nat. Methods. 4, 495–500 (2007).

    Article  CAS  Google Scholar 

  19. White, J.R., Roberts, M., Yorke, J.A. & Pop, M. Figaro: a novel statistical method for vector sequence removal. Bioinformatics. 24, 462–467 (2008).

    Article  CAS  Google Scholar 

  20. Delcher, A.L., Salzberg, S.L. & Phillippy, A.M. Using MUMmer to identify similar regions in large sequence sets. Curr. Protoc. Bioinformatics chapter 10, unit 13 (2003).

    Google Scholar 

  21. Karlin, S. & Burge, C. Dinucleotide relative abundance extremes: a genomic signature. Trends Genet. 11, 283–290 (1995).

    Article  CAS  Google Scholar 

  22. Delcher, A.L., Harmon, D., Kasif, S., White, O. & Salzberg, S.L. Improved microbial gene identification with GLIMMER. Nucleic Acids Res. 27, 4636–4641 (1999).

    Article  CAS  Google Scholar 

Download references

Acknowledgements

We thank A. Delcher for helpful discussions regarding IMM configuration. This work was supported in part by US National Institutes of Health grants R01-LM006845 and R01-GM083873.

Author information

Authors and Affiliations

Authors

Contributions

A.B. performed the experiments and subsequent analysis. A.B. and S.L.S. designed the experiments and wrote the paper.

Corresponding author

Correspondence to Arthur Brady.

Supplementary information

Supplementary Text and Figures

Supplementary Figures 1–10 and Supplementary Tables 1–11 (PDF 702 kb)

Supplementary Software

Open-source installer package for Phymm/PhymmBL, including all algorithms used during setup and scoring. (ZIP 5454 kb)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Brady, A., Salzberg, S. Phymm and PhymmBL: metagenomic phylogenetic classification with interpolated Markov models. Nat Methods 6, 673–676 (2009). https://doi.org/10.1038/nmeth.1358

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1038/nmeth.1358

This article is cited by

Search

Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing