ReviewStatistical data processing in clinical proteomics☆
Introduction
Modern developments in analytical techniques like mass spectrometry (MS) created the opportunity to measure protein concentrations on a large scale; this area of research is called proteomics. The hope is that proteomics studies can contribute to healthcare. In clinical proteomics thousands of proteins or peptides can be measured in a single experiment. This review describes how information is obtained from pre-processed clinical proteomics data and how to validate the information using statistical procedures. The clinical proteomics experiments that we discuss in this paper can be seen as a discovery tool for biomarkers. A possible workflow for biomarker discovery is given in Fig. 1. It starts with a biological question, which leads to a carefully designed experiment, sampling and measurements. Preprocessing of the data is necessary to remove instrumental noise and make the measurements of the samples comparable. A preliminary answer to the biological question is obtained in the three blocks that are encircled in Fig. 1: data processing, biomarker pattern, and statistical validation. After the discovery of statistically valid biomarker leads, external testing and biological validation will show whether they truly answer the biological question.
Biomarkers can be used to predict the state of a patient, in diagnosis, to monitor the response to treatment, and to determine the stage of a disease. For diagnosis, but not essentially different for the other goals, samples from cases and controls are measured. The measurements are usually stored in a data matrix and class labels are stored in a response vector. Data analysis tools try to find the differences in measurements that predict the state of a patient. This information is preferably in just a few proteins (biomarkers) that are indicative for the biological state. Alternatively, the interplay of multivariate data can provide the desired information. Results should be subjected to validation: statistical as well as biological. The statistical validation should investigate the performance of the biomarker, as well as the relevance of the results. The biological validation is concerned with the question whether the biomarkers are involved in processes that can be related to the disease. If the result of both validation processes is satisfactory a putative biomarker is established. Many more steps have to be taken before this leads to an established biomarker [1].
MS is not the only technique used for proteomics investigations. Protein arrays and 2D gels also play an important role in the field [2]. However, when mining the literature on data analysis in clinical proteomics, most hits we encountered were on MS studies. Reviews on the application of MS in proteomics are available [3], [4]; the current review does not discuss the many types of MS experiments. We restrict ourselves mainly to data analysis in single MS experiments (such as liquid chromatography–MS, matrix assisted laser desorption/ionisation MS and surface enhanced laser desorption/ionisation) although our conclusions also hold for other types of (omics) experiments.
In single MS experiments many different issues play a role. Among these are experimental design, selection of patients, sample handling, preprocessing of the spectra and biological validation [5], [6], [7], [8], [9], [10], [11], [12]. In this review we are not taking up these issues but focus on classification methods for proteomics studies and the statistical validation tools that are used in combination with the classification methods.
Classification methods applied in proteomics are developed in different sciences, such as machine learning, chemometrics, data mining and statistics. A wide range of methods is available, with many different characteristics. We try to give an overview of the methods that are popular in proteomics.
The reason that validation in classification methods is an important and still open issue is mainly caused by the characteristics of a proteomics data set. Usually, a mass spectrum contains thousands of different mass/charge (m/z) ratios. The sample size, e.g. the number of patients, is relatively small. This results in a so-called high dimensionality small sample problem. This type of problem suffers from the curse of dimensionality [13], which means that the number of samples needed to accurately describe a (discrimination) problem increases exponentially with the number of dimensions (variables). In proteomics studies, the number of samples is usually low compared to the number of variables, due to the limited availability or the cost of measurements. This undersampling leads to the possibility of discovering a discriminating pattern between two populations, even when these two populations are statistically not distinct. Working with high dimensional data can easily lead to overfitting: the derived model is specific for the training data and does not perform well on new samples.
Literature provides several approaches to overcome these problems. One approach is to reduce the dimensionality of the data. This can be done before a classification is performed or it can be combined with a classifier. Other techniques to cope with high dimensional data are statistical validation strategies, such as cross-validation and permutation tests.
This review starts with an overview of the most encountered methods for classification and biomarker discovery in clinical proteomics. We present a framework in which most of the methods fall. And finally a strategy is put forward for a thorough statistical assessment of the entire data analysis procedure.
Section snippets
Feature selection
Feature selection plays an important role in clinical data analysis for three reasons. First, using all features in forming the classification rule in general does not give the best performance. Increasing the number of features from zero enhances performance to some point, after which adding more feature leads to a deteriorating performance, because many features are uninformative and they can conceal information in relevant features. This is called the peaking phenomenon [14], [15], [16]. The
Discriminant analysis
Discriminant analysis (DA) was first introduced by Fisher, who used it to discriminate between different Iris species [27]. In the feature space, a direction is sought that maximizes the differences between the classes with respect to the covariance within the control and case classes (Fig. 2). This direction, the discriminant vector, can be used to classify new samples. DA uses the covariance matrix to find the discriminant vector. Linear discriminant analysis (LDA) assumes the within-class
Biomarker candidate selection
With biomarker candidate selection we refer to feature selection with the aim to discover which proteins are promising leads for biomarkers. We place this module after the classification methods, because the classification rules contain information about the contribution of each variable to the classification. This information reveals the proteins of interest, which may prove to be biomarkers. Two methods that determine the interesting variables directly are the classification tree [13], which
Comparison studies
Many more classification algorithms are available; the list of classifiers and variable selection methods we discuss is not exhaustive. The question arises which method is best suited for classification of proteomics data. It is hard to compare results from different studies because conditions vary. This is due to the fact that preprocessing, reporting of performance and validation schemes are not the same. There are some studies that describe performance of several classification methods
Statistical validation
The next step towards clinical utility is validation. First, the results of a preliminary clinical proteomics study should be subjected to thorough statistical assessment. Next, a new set of samples should be measured independently in time and/or place from the first data set to test the classifier. If the preliminary results warrant the investment, the following step would be identification of the relevant proteins to determine biological validity.
In this section we describe two tools,
Proteomics data analysis: a framework
Data analysis methods extract information from the data to predict the class. As shown, there are many methods for feature selection, classification, biomarker candidate selection and statistical validation. It is possible to combine methods in different ways, leading to many data analysis approaches. We propose a modular data analysis framework (Fig. 6), in which most data analysis strategies fit. Some of the modules are optional, but validation is not! For each module the researcher can use
External test set
If there is only one data set available a cross-validation approach makes efficient use of the data [97]. However, an external test set is always of added value [94]. An external data set obtained in a different way can show whether the model is not too specific for the data set that is used to construct the classification rule. For example the measurement could be performed on another instrument, by a different person, and the samples could have been obtained from a different population of
Conclusions
Proteomics research, despite the large effort in recent years, knows many issues that are still subject to debate. This review discussed some issues related to the analysis of proteomics data. Due to the complex nature and the high dimensionality of the data it generates it is easy to find differences between groups. But these differences are possibly just chance results. The goal is to develop classifiers and/or biomarkers that perform well on new data. Furthermore, a proper estimate of the
Acknowledgement
We thank Daniel Vis for careful reading of this manuscript.
References (102)
- et al.
J. Chromatogr. B
(2007) - et al.
Pattern Recogn.
(1971) - et al.
TrAC, Trends Anal. Chem.
(1998) - et al.
Anal. Chim. Acta
(2007) - et al.
Chemom. Intell. Lab. Syst.
(2004) - et al.
Lancet
(2006) - et al.
Neoplasia
(2004) - et al.
J. Multivar. Anal.
(2004) - et al.
Mol. Cell. Proteomics
(2006) - et al.
Kidney Int.
(2005)
Artif. Intell. Med.
Neurocomputing
Neurocomputing
Neurocomputing
FEBS Lett.
BMC Bioinform.
TrAC Trends Anal. Chem.
Artif. Intell.
TrAC Trends Anal. Chem.
J. Natl. Cancer Inst.
Nature
Nature
Science
Proteomics
J. Proteome Res.
Clin. Chem.
Expert Rev. Proteomics
Anal. Chem.
Bju Int.
Mass Spectrom. Rev.
The Elements of Statistical Learning. Data mining, Inference and Prediction
Bioinformatics
Curr. Genomics
Mach. Learn.
BMC Bioinform.
Bioinformatics
BMC Bioinform.
Metabolomics
Soc. B
J.R. Statist. Soc. B
Proc. Natl. Acad. Sci. U.S.A.
Principal Component Analysis
Ann. Eugen.
J. Am. Stat. Assoc.
J. Am. Stat. Assoc.
Anal. Chem.
IEEE/ACM Trans. Comput. Biol. Bioinform.
J. Comput. Biol.
Siam J. Sci. Stat. Comput.
J. Chemom.
Cited by (55)
Circulating miRNA analysis for cancer diagnostics and therapy
2020, Molecular Aspects of MedicineCitation Excerpt :Whereas the unsupervised methods (also called clustering techniques) can be applied directly on data measured on unknown samples, supervised methods require a training set of independently classified samples to be available. For comprehensive overview of multivariate methods for biomarker identification, we recommend recent reviews from the related field of proteomics (Robotti et al., 2014; Smit et al., 2008). Once candidate miRNA biomarkers are identified and a classifier developed, its performance can be evaluated by receiver operating characteristic (ROC) analysis (Lusted, 1971).
Preliminary study on plasma proteins in pregnant and non-pregnant female dogs
2017, TheriogenologySeverity of thought disorder predicts psychosis in persons at clinical high-risk
2015, Schizophrenia ResearchAnalysis of splashing in basic oxygen furnace through systematic modelling
2015, IFAC-PapersOnLineKNN classification - evaluated by repeated double cross validation: Recognition of minerals relevant for comet dust
2014, Chemometrics and Intelligent Laboratory SystemsProteomics
2014, Handbook of Pharmacogenomics and Stratified Medicine
- ☆
This paper is part of a Special Issue dedicated to the 50th anniversary of Journal of Chromatography.