Review
Statistical data processing in clinical proteomics

https://doi.org/10.1016/j.jchromb.2007.10.042Get rights and content

Abstract

This review discusses data analysis strategies for the discovery of biomarkers in clinical proteomics. Proteomics studies produce large amounts of data, characterized by few samples of which many variables are measured. A wealth of classification methods exists for extracting information from the data. Feature selection plays an important role in reducing the dimensionality of the data prior to classification and in discovering biomarker leads. The question which classification strategy works best is yet unanswered. Validation is a crucial step for biomarker leads towards clinical use. Here we only discuss statistical validation, recognizing that biological and clinical validation is of utmost importance. First, there is the need for validated model selection to develop a generalized classifier that predicts new samples correctly. A cross-validation loop that is wrapped around the model development procedure assesses the performance using unseen data. The significance of the model should be tested; we use permutations of the data for comparison with uninformative data. This procedure also tests the correctness of the performance validation. Preferably, a new set of samples is measured to test the classifier and rule out results specific for a machine, analyst, laboratory or the first set of samples. This is not yet standard practice. We present a modular framework that combines feature selection, classification, biomarker discovery and statistical validation; these data analysis aspects are all discussed in this review. The feature selection, classification and biomarker discovery modules can be incorporated or omitted to the preference of the researcher. The validation modules, however, should not be optional. In each module, the researcher can select from a wide range of methods, since there is not one unique way that leads to the correct model and proper validation. We discuss many possibilities for feature selection, classification and biomarker discovery. For validation we advice a combination of cross-validation and permutation testing, a validation strategy supported in the literature.

Introduction

Modern developments in analytical techniques like mass spectrometry (MS) created the opportunity to measure protein concentrations on a large scale; this area of research is called proteomics. The hope is that proteomics studies can contribute to healthcare. In clinical proteomics thousands of proteins or peptides can be measured in a single experiment. This review describes how information is obtained from pre-processed clinical proteomics data and how to validate the information using statistical procedures. The clinical proteomics experiments that we discuss in this paper can be seen as a discovery tool for biomarkers. A possible workflow for biomarker discovery is given in Fig. 1. It starts with a biological question, which leads to a carefully designed experiment, sampling and measurements. Preprocessing of the data is necessary to remove instrumental noise and make the measurements of the samples comparable. A preliminary answer to the biological question is obtained in the three blocks that are encircled in Fig. 1: data processing, biomarker pattern, and statistical validation. After the discovery of statistically valid biomarker leads, external testing and biological validation will show whether they truly answer the biological question.

Biomarkers can be used to predict the state of a patient, in diagnosis, to monitor the response to treatment, and to determine the stage of a disease. For diagnosis, but not essentially different for the other goals, samples from cases and controls are measured. The measurements are usually stored in a data matrix and class labels are stored in a response vector. Data analysis tools try to find the differences in measurements that predict the state of a patient. This information is preferably in just a few proteins (biomarkers) that are indicative for the biological state. Alternatively, the interplay of multivariate data can provide the desired information. Results should be subjected to validation: statistical as well as biological. The statistical validation should investigate the performance of the biomarker, as well as the relevance of the results. The biological validation is concerned with the question whether the biomarkers are involved in processes that can be related to the disease. If the result of both validation processes is satisfactory a putative biomarker is established. Many more steps have to be taken before this leads to an established biomarker [1].

MS is not the only technique used for proteomics investigations. Protein arrays and 2D gels also play an important role in the field [2]. However, when mining the literature on data analysis in clinical proteomics, most hits we encountered were on MS studies. Reviews on the application of MS in proteomics are available [3], [4]; the current review does not discuss the many types of MS experiments. We restrict ourselves mainly to data analysis in single MS experiments (such as liquid chromatography–MS, matrix assisted laser desorption/ionisation MS and surface enhanced laser desorption/ionisation) although our conclusions also hold for other types of (omics) experiments.

In single MS experiments many different issues play a role. Among these are experimental design, selection of patients, sample handling, preprocessing of the spectra and biological validation [5], [6], [7], [8], [9], [10], [11], [12]. In this review we are not taking up these issues but focus on classification methods for proteomics studies and the statistical validation tools that are used in combination with the classification methods.

Classification methods applied in proteomics are developed in different sciences, such as machine learning, chemometrics, data mining and statistics. A wide range of methods is available, with many different characteristics. We try to give an overview of the methods that are popular in proteomics.

The reason that validation in classification methods is an important and still open issue is mainly caused by the characteristics of a proteomics data set. Usually, a mass spectrum contains thousands of different mass/charge (m/z) ratios. The sample size, e.g. the number of patients, is relatively small. This results in a so-called high dimensionality small sample problem. This type of problem suffers from the curse of dimensionality [13], which means that the number of samples needed to accurately describe a (discrimination) problem increases exponentially with the number of dimensions (variables). In proteomics studies, the number of samples is usually low compared to the number of variables, due to the limited availability or the cost of measurements. This undersampling leads to the possibility of discovering a discriminating pattern between two populations, even when these two populations are statistically not distinct. Working with high dimensional data can easily lead to overfitting: the derived model is specific for the training data and does not perform well on new samples.

Literature provides several approaches to overcome these problems. One approach is to reduce the dimensionality of the data. This can be done before a classification is performed or it can be combined with a classifier. Other techniques to cope with high dimensional data are statistical validation strategies, such as cross-validation and permutation tests.

This review starts with an overview of the most encountered methods for classification and biomarker discovery in clinical proteomics. We present a framework in which most of the methods fall. And finally a strategy is put forward for a thorough statistical assessment of the entire data analysis procedure.

Section snippets

Feature selection

Feature selection plays an important role in clinical data analysis for three reasons. First, using all features in forming the classification rule in general does not give the best performance. Increasing the number of features from zero enhances performance to some point, after which adding more feature leads to a deteriorating performance, because many features are uninformative and they can conceal information in relevant features. This is called the peaking phenomenon [14], [15], [16]. The

Discriminant analysis

Discriminant analysis (DA) was first introduced by Fisher, who used it to discriminate between different Iris species [27]. In the feature space, a direction is sought that maximizes the differences between the classes with respect to the covariance within the control and case classes (Fig. 2). This direction, the discriminant vector, can be used to classify new samples. DA uses the covariance matrix to find the discriminant vector. Linear discriminant analysis (LDA) assumes the within-class

Biomarker candidate selection

With biomarker candidate selection we refer to feature selection with the aim to discover which proteins are promising leads for biomarkers. We place this module after the classification methods, because the classification rules contain information about the contribution of each variable to the classification. This information reveals the proteins of interest, which may prove to be biomarkers. Two methods that determine the interesting variables directly are the classification tree [13], which

Comparison studies

Many more classification algorithms are available; the list of classifiers and variable selection methods we discuss is not exhaustive. The question arises which method is best suited for classification of proteomics data. It is hard to compare results from different studies because conditions vary. This is due to the fact that preprocessing, reporting of performance and validation schemes are not the same. There are some studies that describe performance of several classification methods

Statistical validation

The next step towards clinical utility is validation. First, the results of a preliminary clinical proteomics study should be subjected to thorough statistical assessment. Next, a new set of samples should be measured independently in time and/or place from the first data set to test the classifier. If the preliminary results warrant the investment, the following step would be identification of the relevant proteins to determine biological validity.

In this section we describe two tools,

Proteomics data analysis: a framework

Data analysis methods extract information from the data to predict the class. As shown, there are many methods for feature selection, classification, biomarker candidate selection and statistical validation. It is possible to combine methods in different ways, leading to many data analysis approaches. We propose a modular data analysis framework (Fig. 6), in which most data analysis strategies fit. Some of the modules are optional, but validation is not! For each module the researcher can use

External test set

If there is only one data set available a cross-validation approach makes efficient use of the data [97]. However, an external test set is always of added value [94]. An external data set obtained in a different way can show whether the model is not too specific for the data set that is used to construct the classification rule. For example the measurement could be performed on another instrument, by a different person, and the samples could have been obtained from a different population of

Conclusions

Proteomics research, despite the large effort in recent years, knows many issues that are still subject to debate. This review discussed some issues related to the analysis of proteomics data. Due to the complex nature and the high dimensionality of the data it generates it is easy to find differences between groups. But these differences are possibly just chance results. The goal is to develop classifiers and/or biomarkers that perform well on new data. Furthermore, a proper estimate of the

Acknowledgement

We thank Daniel Vis for careful reading of this manuscript.

References (102)

  • M. Dijkstra et al.

    J. Chromatogr. B

    (2007)
  • L. Kanal et al.

    Pattern Recogn.

    (1971)
  • R. Wehrens et al.

    TrAC, Trends Anal. Chem.

    (1998)
  • S. Smit et al.

    Anal. Chim. Acta

    (2007)
  • J. Gottfries et al.

    Chemom. Intell. Lab. Syst.

    (2004)
  • D. Agranoff et al.

    Lancet

    (2006)
  • S. Bhattacharyya et al.

    Neoplasia

    (2004)
  • M. Dettling et al.

    J. Multivar. Anal.

    (2004)
  • Q.H.C. Ru et al.

    Mol. Cell. Proteomics

    (2006)
  • J.C. Oates et al.

    Kidney Int.

    (2005)
  • J.H. Hong et al.

    Artif. Intell. Med.

    (2006)
  • G. Valentini et al.

    Neurocomputing

    (2004)
  • A. Bertoni et al.

    Neurocomputing

    (2005)
  • K.J. Kim et al.

    Neurocomputing

    (2006)
  • R. Breitling et al.

    FEBS Lett.

    (2004)
  • M. Wagner et al.

    BMC Bioinform.

    (2004)
  • K. Baumann

    TrAC Trends Anal. Chem.

    (2003)
  • R. Kohavi et al.

    Artif. Intell.

    (1997)
  • R.G. Brereton

    TrAC Trends Anal. Chem.

    (2006)
  • M.S. Pepe et al.

    J. Natl. Cancer Inst.

    (2001)
  • S. Hanash

    Nature

    (2003)
  • R. Aebersold et al.

    Nature

    (2003)
  • B. Domon et al.

    Science

    (2006)
  • G.S. Omenn

    Proteomics

    (2006)
  • J. Villanueva et al.

    J. Proteome Res.

    (2005)
  • R.A.R. Bowen et al.

    Clin. Chem.

    (2005)
  • A.J. Rai et al.

    Expert Rev. Proteomics

    (2006)
  • M. West-Nielsen et al.

    Anal. Chem.

    (2005)
  • A.E. Pelzer et al.

    Bju Int.

    (2007)
  • M. Hilario et al.

    Mass Spectrom. Rev.

    (2006)
  • T. Hastie et al.

    The Elements of Statistical Learning. Data mining, Inference and Prediction

    (2001)
  • A. Choudhary et al.

    Bioinformatics

    (2006)
  • E.R. Dougherty et al.

    Curr. Genomics

    (2007)
  • I. Guyon et al.

    Mach. Learn.

    (2002)
  • M.K. Titulaer et al.

    BMC Bioinform.

    (2006)
  • B.L. Wu et al.

    Bioinformatics

    (2003)
  • I. Levner

    BMC Bioinform.

    (2005)
  • D.I. Broadhurst et al.

    Metabolomics

    (2006)
  • Y. Benjamini et al.

    Soc. B

    (1995)
  • J.D. Storey

    J.R. Statist. Soc. B

    (2002)
  • V.G. Tusher et al.

    Proc. Natl. Acad. Sci. U.S.A.

    (2001)
  • I.T. Joliffe

    Principal Component Analysis

    (2002)
  • R.A. Fisher

    Ann. Eugen.

    (1936)
  • J.H. Friedman

    J. Am. Stat. Assoc.

    (1989)
  • S. Dudoit et al.

    J. Am. Stat. Assoc.

    (2002)
  • R. Hoogerbrugge et al.

    Anal. Chem.

    (1983)
  • J. Ye et al.

    IEEE/ACM Trans. Comput. Biol. Bioinform.

    (2004)
  • R.H. Lilien et al.

    J. Comput. Biol.

    (2003)
  • S. Wold et al.

    Siam J. Sci. Stat. Comput.

    (1984)
  • M. Barker et al.

    J. Chemom.

    (2003)
  • Cited by (55)

    • Circulating miRNA analysis for cancer diagnostics and therapy

      2020, Molecular Aspects of Medicine
      Citation Excerpt :

      Whereas the unsupervised methods (also called clustering techniques) can be applied directly on data measured on unknown samples, supervised methods require a training set of independently classified samples to be available. For comprehensive overview of multivariate methods for biomarker identification, we recommend recent reviews from the related field of proteomics (Robotti et al., 2014; Smit et al., 2008). Once candidate miRNA biomarkers are identified and a classifier developed, its performance can be evaluated by receiver operating characteristic (ROC) analysis (Lusted, 1971).

    • Proteomics

      2014, Handbook of Pharmacogenomics and Stratified Medicine
    View all citing articles on Scopus

    This paper is part of a Special Issue dedicated to the 50th anniversary of Journal of Chromatography.

    View full text