Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Perspective
  • Published:

A common open representation of mass spectrometry data and its application to proteomics research

Abstract

A broad range of mass spectrometers are used in mass spectrometry (MS)-based proteomics research. Each type of instrument possesses a unique design, data system and performance specifications, resulting in strengths and weaknesses for different types of experiments. Unfortunately, the native binary data formats produced by each type of mass spectrometer also differ and are usually proprietary. The diverse, nontransparent nature of the data structure complicates the integration of new instruments into preexisting infrastructure, impedes the analysis, exchange, comparison and publication of results from different experiments and laboratories, and prevents the bioinformatics community from accessing data sets required for software development. Here, we introduce the 'mzXML' format, an open, generic XML (extensible markup language) representation of MS data. We have also developed an accompanying suite of supporting programs. We expect that this format will facilitate data management, interpretation and dissemination in proteomics research.

This is a preview of subscription content, access via your institution

Access options

Rent or buy this article

Prices vary by article type

from$1.95

to$39.95

Prices may be subject to local taxes which are calculated during checkout

Figure 1: The mzXML file acts as a mediator, allowing multiple input formats to be subjected to a common data analysis pipeline.
Figure 2: Overview of the mzXML format.
Figure 3: Although most mass spectrometers are capable of exporting data in formats recognized by sequence search engines (e.g., .dta for SEQUEST, .mgf for Mascot and .pkl for ProteinLynx), other data analysis operations, such as peptide quantification after stable isotope labeling, are not supported by these formats.
Figure 4: Various methods for visualization of data contained in mzXML documents.
Figure 5: Role of the mzXML format in an analysis framework.
Figure 6: Standard versus mzXML database search results.

Similar content being viewed by others

References

  1. Patterson, S.D. & Aebersold, R.H. Proteomics: the first decade and beyond. Nat. Genet. 33 suppl., 311–323 (2003).

    Article  CAS  PubMed  Google Scholar 

  2. Spellman, P.T. et al. Design and implementation of microarray gene expression markup language (MAGE-ML). Genome Biol. 3, RESEARCH0046 (2002).

  3. Li, X., Zhang, H., Ranish, J.A. & Aebersold, R. Automated statistical analysis of protein abundance ratios from data generated by stable-isotope dilution and tandem mass spectrometry. Anal. Chem. 75, 6658–6665 (2003).

    Article  CAS  Google Scholar 

  4. Han, D.K., Eng, J., Zhou, H. & Aebersold, R. Quantitative profiling of differentiation-induced microsomal proteins using isotope-coded affinity tags and mass spectrometry. Nat. Biotechnol. 19, 946–951 (2001).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  5. Purvine, S., Eppel, J.T., Yi, E.C. & Goodlett, D.R. Shotgun collision-induced dissociation of peptides using a time of flight mass analyzer. Proteomics 3, 847–850 (2003).

    Article  CAS  PubMed  Google Scholar 

  6. Eng, J.K., McCormack, A.L. & Yates, J.R. III. An approach to correlate tandem mass spectral data of peptides with amino acid sequences in a protein database. J. Am. Soc. Mass Spectrom. 5, 976–989 (1994).

    Article  CAS  PubMed  Google Scholar 

  7. Perkins, D.N., Pappin, D.J., Creasy, D.M. & Cottrell, J.S. Probability-based protein identification by searching sequence databases using mass spectrometry data. Electrophoresis 20, 3551–3567 (1999).

    Article  CAS  PubMed  Google Scholar 

  8. Li, X.J. et al. A tool to visualize and evaluate data obtained by liquid chromatography-electrospray ionization-mass spectrometry. Anal. Chem. 76, 3856–3860 (2004).

    Article  CAS  PubMed  Google Scholar 

  9. Gygi, S.P. et al. Quantitative analysis of complex protein mixtures using isotope-coded affinity tags. Nat. Biotechnol. 17, 994–999 (1999).

    Article  CAS  PubMed  Google Scholar 

  10. Keller, A., Nesvizhskii, A.I., Kolker, E. & Aebersold, R. Empirical statistical model to estimate the accuracy of peptide identifications made by MS/MS and database search. Anal. Chem. 74, 5383–5392 (2002).

    Article  CAS  PubMed  Google Scholar 

  11. Nesvizhskii, A.I., Keller, A., Kolker, E. & Aebersold, R. A statistical model for identifying proteins by tandem mass spectrometry. Anal. Chem. 75, 4646–4658 (2003).

    Article  CAS  PubMed  Google Scholar 

  12. Baldwin, M.A. Protein identification by mass spectrometry: issues to be considered. Mol. Cell. Proteomics 3, 1–9 (2004).

    Article  CAS  PubMed  Google Scholar 

  13. Prince, J.T., Carlson, M.W., Wang, R., Lu, P. & Marcotte, E.M. The need for a public proteomics repository. Nat. Biotechnol. 22, 471–472 (2004).

    Article  CAS  PubMed  Google Scholar 

  14. Zhang, N., Aebersold, R. & Schwikowski, B. ProbID: a probabilistic algorithm to identify peptides through sequence database searching using tandem mass spectral data. Proteomics 2, 1406–1412 (2002).

    Article  CAS  PubMed  Google Scholar 

  15. Shannon, P. et al. Cytoscape: a software environment for integrated models of biomolecular interaction networks. Genome Res. 13, 2498–2504 (2003).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

Download references

Acknowledgements

This project was funded in part by federal funds from the National Heart, Lung, and Blood Institute, National Institutes of Health, under contract no. N01-HV-28179 and by grant no. 1R33CA93302 from the National Cancer Institute. The Institute for Systems Biology is supported by a generous gift from Merck and Co. We are grateful to SourceForge for hosting the project and Eugene Yi for providing the seven-protein mix data set. We would also like to acknowledge the following for endorsing the mzXML format: Philip C. Andrews, Tom Blackwell, Daniel Burns, Jayson Falkner, Panagiotis Papoulias, Abhik Shah, Peter Ulintz, Al Burlingame, Robert Chalkley, Karl Clauser, Bruno Domon, James Eddes, Robert Moritz, Daniel Figeys, Barry L. Karger, William Hancock, Tomas Rejtar, Peter James, Matthias Mann, Sanford Markey, Matthias Wilm, Ken Williams and Kratos Analytical Limited (a Shimadzu Group Company).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ruedi Aebersold.

Ethics declarations

Competing interests

The authors declare no competing financial interests.

Supplementary information

Supplementary Fig. 1

The PEDRo model and the mzXML format. (PDF 420 kb)

Supplementary Fig. 2

The mzXML index. (PDF 19 kb)

Supplementary Methods (PDF 11 kb)

Supplementary Notes

Quick introduction to XML. (PDF 21 kb)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Pedrioli, P., Eng, J., Hubley, R. et al. A common open representation of mass spectrometry data and its application to proteomics research. Nat Biotechnol 22, 1459–1466 (2004). https://doi.org/10.1038/nbt1031

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1038/nbt1031

This article is cited by

Search

Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing