Statistical data processing in clinical proteomics

doi:10.1016/j.jchromb.2007.10.042

Journal of Chromatography B

Volume 866, Issues 1–2, 15 April 2008, Pages 77-88

https://doi.org/10.1016/j.jchromb.2007.10.042 Get rights and content

Abstract

This review discusses data analysis strategies for the discovery of biomarkers in clinical proteomics. Proteomics studies produce large amounts of data, characterized by few samples of which many variables are measured. A wealth of classification methods exists for extracting information from the data. Feature selection plays an important role in reducing the dimensionality of the data prior to classification and in discovering biomarker leads. The question which classification strategy works best is yet unanswered. Validation is a crucial step for biomarker leads towards clinical use. Here we only discuss statistical validation, recognizing that biological and clinical validation is of utmost importance. First, there is the need for validated model selection to develop a generalized classifier that predicts new samples correctly. A cross-validation loop that is wrapped around the model development procedure assesses the performance using unseen data. The significance of the model should be tested; we use permutations of the data for comparison with uninformative data. This procedure also tests the correctness of the performance validation. Preferably, a new set of samples is measured to test the classifier and rule out results specific for a machine, analyst, laboratory or the first set of samples. This is not yet standard practice. We present a modular framework that combines feature selection, classification, biomarker discovery and statistical validation; these data analysis aspects are all discussed in this review. The feature selection, classification and biomarker discovery modules can be incorporated or omitted to the preference of the researcher. The validation modules, however, should not be optional. In each module, the researcher can select from a wide range of methods, since there is not one unique way that leads to the correct model and proper validation. We discuss many possibilities for feature selection, classification and biomarker discovery. For validation we advice a combination of cross-validation and permutation testing, a validation strategy supported in the literature.

Introduction

Modern developments in analytical techniques like mass spectrometry (MS) created the opportunity to measure protein concentrations on a large scale; this area of research is called proteomics. The hope is that proteomics studies can contribute to healthcare. In clinical proteomics thousands of proteins or peptides can be measured in a single experiment. This review describes how information is obtained from pre-processed clinical proteomics data and how to validate the information using statistical procedures. The clinical proteomics experiments that we discuss in this paper can be seen as a discovery tool for biomarkers. A possible workflow for biomarker discovery is given in Fig. 1. It starts with a biological question, which leads to a carefully designed experiment, sampling and measurements. Preprocessing of the data is necessary to remove instrumental noise and make the measurements of the samples comparable. A preliminary answer to the biological question is obtained in the three blocks that are encircled in Fig. 1: data processing, biomarker pattern, and statistical validation. After the discovery of statistically valid biomarker leads, external testing and biological validation will show whether they truly answer the biological question.

Biomarkers can be used to predict the state of a patient, in diagnosis, to monitor the response to treatment, and to determine the stage of a disease. For diagnosis, but not essentially different for the other goals, samples from cases and controls are measured. The measurements are usually stored in a data matrix and class labels are stored in a response vector. Data analysis tools try to find the differences in measurements that predict the state of a patient. This information is preferably in just a few proteins (biomarkers) that are indicative for the biological state. Alternatively, the interplay of multivariate data can provide the desired information. Results should be subjected to validation: statistical as well as biological. The statistical validation should investigate the performance of the biomarker, as well as the relevance of the results. The biological validation is concerned with the question whether the biomarkers are involved in processes that can be related to the disease. If the result of both validation processes is satisfactory a putative biomarker is established. Many more steps have to be taken before this leads to an established biomarker [1].

MS is not the only technique used for proteomics investigations. Protein arrays and 2D gels also play an important role in the field [2]. However, when mining the literature on data analysis in clinical proteomics, most hits we encountered were on MS studies. Reviews on the application of MS in proteomics are available [3], [4]; the current review does not discuss the many types of MS experiments. We restrict ourselves mainly to data analysis in single MS experiments (such as liquid chromatography–MS, matrix assisted laser desorption/ionisation MS and surface enhanced laser desorption/ionisation) although our conclusions also hold for other types of (omics) experiments.

In single MS experiments many different issues play a role. Among these are experimental design, selection of patients, sample handling, preprocessing of the spectra and biological validation [5], [6], [7], [8], [9], [10], [11], [12]. In this review we are not taking up these issues but focus on classification methods for proteomics studies and the statistical validation tools that are used in combination with the classification methods.

Classification methods applied in proteomics are developed in different sciences, such as machine learning, chemometrics, data mining and statistics. A wide range of methods is available, with many different characteristics. We try to give an overview of the methods that are popular in proteomics.

The reason that validation in classification methods is an important and still open issue is mainly caused by the characteristics of a proteomics data set. Usually, a mass spectrum contains thousands of different mass/charge (m/z) ratios. The sample size, e.g. the number of patients, is relatively small. This results in a so-called high dimensionality small sample problem. This type of problem suffers from the curse of dimensionality [13], which means that the number of samples needed to accurately describe a (discrimination) problem increases exponentially with the number of dimensions (variables). In proteomics studies, the number of samples is usually low compared to the number of variables, due to the limited availability or the cost of measurements. This undersampling leads to the possibility of discovering a discriminating pattern between two populations, even when these two populations are statistically not distinct. Working with high dimensional data can easily lead to overfitting: the derived model is specific for the training data and does not perform well on new samples.

Literature provides several approaches to overcome these problems. One approach is to reduce the dimensionality of the data. This can be done before a classification is performed or it can be combined with a classifier. Other techniques to cope with high dimensional data are statistical validation strategies, such as cross-validation and permutation tests.

This review starts with an overview of the most encountered methods for classification and biomarker discovery in clinical proteomics. We present a framework in which most of the methods fall. And finally a strategy is put forward for a thorough statistical assessment of the entire data analysis procedure.

Section snippets

Feature selection

Feature selection plays an important role in clinical data analysis for three reasons. First, using all features in forming the classification rule in general does not give the best performance. Increasing the number of features from zero enhances performance to some point, after which adding more feature leads to a deteriorating performance, because many features are uninformative and they can conceal information in relevant features. This is called the peaking phenomenon [14], [15], [16]. The

Discriminant analysis

Discriminant analysis (DA) was first introduced by Fisher, who used it to discriminate between different Iris species [27]. In the feature space, a direction is sought that maximizes the differences between the classes with respect to the covariance within the control and case classes (Fig. 2). This direction, the discriminant vector, can be used to classify new samples. DA uses the covariance matrix to find the discriminant vector. Linear discriminant analysis (LDA) assumes the within-class

Biomarker candidate selection

With biomarker candidate selection we refer to feature selection with the aim to discover which proteins are promising leads for biomarkers. We place this module after the classification methods, because the classification rules contain information about the contribution of each variable to the classification. This information reveals the proteins of interest, which may prove to be biomarkers. Two methods that determine the interesting variables directly are the classification tree [13], which

Comparison studies

Many more classification algorithms are available; the list of classifiers and variable selection methods we discuss is not exhaustive. The question arises which method is best suited for classification of proteomics data. It is hard to compare results from different studies because conditions vary. This is due to the fact that preprocessing, reporting of performance and validation schemes are not the same. There are some studies that describe performance of several classification methods

Statistical validation

The next step towards clinical utility is validation. First, the results of a preliminary clinical proteomics study should be subjected to thorough statistical assessment. Next, a new set of samples should be measured independently in time and/or place from the first data set to test the classifier. If the preliminary results warrant the investment, the following step would be identification of the relevant proteins to determine biological validity.

In this section we describe two tools,

Proteomics data analysis: a framework

Data analysis methods extract information from the data to predict the class. As shown, there are many methods for feature selection, classification, biomarker candidate selection and statistical validation. It is possible to combine methods in different ways, leading to many data analysis approaches. We propose a modular data analysis framework (Fig. 6), in which most data analysis strategies fit. Some of the modules are optional, but validation is not! For each module the researcher can use

External test set

If there is only one data set available a cross-validation approach makes efficient use of the data [97]. However, an external test set is always of added value [94]. An external data set obtained in a different way can show whether the model is not too specific for the data set that is used to construct the classification rule. For example the measurement could be performed on another instrument, by a different person, and the samples could have been obtained from a different population of

Conclusions

Proteomics research, despite the large effort in recent years, knows many issues that are still subject to debate. This review discussed some issues related to the analysis of proteomics data. Due to the complex nature and the high dimensionality of the data it generates it is easy to find differences between groups. But these differences are possibly just chance results. The goal is to develop classifiers and/or biomarkers that perform well on new data. Furthermore, a proper estimate of the

Acknowledgement

We thank Daniel Vis for careful reading of this manuscript.

References (102)

M. Dijkstra et al.
J. Chromatogr. B
(2007)
L. Kanal et al.
Pattern Recogn.
(1971)
R. Wehrens et al.
TrAC, Trends Anal. Chem.
(1998)
S. Smit et al.
Anal. Chim. Acta
(2007)
J. Gottfries et al.
Chemom. Intell. Lab. Syst.
(2004)
D. Agranoff et al.
Lancet
(2006)
S. Bhattacharyya et al.
Neoplasia
(2004)
M. Dettling et al.
J. Multivar. Anal.
(2004)
Q.H.C. Ru et al.
Mol. Cell. Proteomics
(2006)
J.C. Oates et al.
Kidney Int.
(2005)

J.H. Hong et al.

Artif. Intell. Med.

(2006)

G. Valentini et al.

Neurocomputing

(2004)

A. Bertoni et al.

Neurocomputing

(2005)

K.J. Kim et al.

Neurocomputing

(2006)

R. Breitling et al.

FEBS Lett.

(2004)

M. Wagner et al.

BMC Bioinform.

(2004)

K. Baumann

TrAC Trends Anal. Chem.

(2003)

R. Kohavi et al.

Artif. Intell.

(1997)

R.G. Brereton

TrAC Trends Anal. Chem.

(2006)

M.S. Pepe et al.

J. Natl. Cancer Inst.

(2001)

S. Hanash

Nature

(2003)

R. Aebersold et al.

Nature

(2003)

B. Domon et al.

Science

(2006)

G.S. Omenn

Proteomics

(2006)

J. Villanueva et al.

J. Proteome Res.

(2005)

R.A.R. Bowen et al.

Clin. Chem.

(2005)

A.J. Rai et al.

Expert Rev. Proteomics

(2006)

M. West-Nielsen et al.

Anal. Chem.

(2005)

A.E. Pelzer et al.

Bju Int.

(2007)

M. Hilario et al.

Mass Spectrom. Rev.

(2006)

T. Hastie et al.

The Elements of Statistical Learning. Data mining, Inference and Prediction

(2001)

A. Choudhary et al.

Bioinformatics

(2006)

E.R. Dougherty et al.

Curr. Genomics

(2007)

I. Guyon et al.

Mach. Learn.

(2002)

M.K. Titulaer et al.

BMC Bioinform.

(2006)

B.L. Wu et al.

Bioinformatics

(2003)

I. Levner

BMC Bioinform.

(2005)

D.I. Broadhurst et al.

Metabolomics

(2006)

Y. Benjamini et al.

Soc. B

(1995)

J.D. Storey

J.R. Statist. Soc. B

(2002)

V.G. Tusher et al.

Proc. Natl. Acad. Sci. U.S.A.

(2001)

I.T. Joliffe

Principal Component Analysis

(2002)

R.A. Fisher

Ann. Eugen.

(1936)

J.H. Friedman

J. Am. Stat. Assoc.

(1989)

S. Dudoit et al.

J. Am. Stat. Assoc.

(2002)

R. Hoogerbrugge et al.

Anal. Chem.

(1983)

J. Ye et al.

IEEE/ACM Trans. Comput. Biol. Bioinform.

(2004)

R.H. Lilien et al.

J. Comput. Biol.

(2003)

S. Wold et al.

Siam J. Sci. Stat. Comput.

(1984)

M. Barker et al.

J. Chemom.

(2003)

Cited by (55)

Circulating miRNA analysis for cancer diagnostics and therapy
2020, Molecular Aspects of Medicine
Citation Excerpt :
Whereas the unsupervised methods (also called clustering techniques) can be applied directly on data measured on unknown samples, supervised methods require a training set of independently classified samples to be available. For comprehensive overview of multivariate methods for biomarker identification, we recommend recent reviews from the related field of proteomics (Robotti et al., 2014; Smit et al., 2008). Once candidate miRNA biomarkers are identified and a classifier developed, its performance can be evaluated by receiver operating characteristic (ROC) analysis (Lusted, 1971).
Successful treatment of cancer depends on early diagnosis and effective monitoring of patients’ response to therapy. Traditional tools based on tumor biopsies lack the sensitivity and specificity to capture cancer development in its early phases and are not applicable for continuous monitoring. To overcome these barriers, liquid biopsies have been introduced as a minimally invasive and cost-efficient means of diagnosis with high level of specificity and sensitivity. Traditionally, liquid biopsy markers include circulating tumor cells and circulating tumor DNA. During the last decade, a new promising group of biomarkers has appeared and its utilization for cancer diagnosis and monitoring is intensively studied – the microRNAs (miRNAs). In this review, we provide a comprehensive overview of circulating miRNA analysis. We highlight the importance of sampling and quality control, discuss the technical aspects of miRNA extraction and quantification, summarize recommendations for downstream analysis and conclude with future perspectives. Taken together, we present the current state of knowledge in the field of miRNA analysis in liquid biopsies and the expected development and standardization.
Preliminary study on plasma proteins in pregnant and non-pregnant female dogs
2017, Theriogenology
In this study, we used a combined approach based on 2-dimensional electrophoresis (2-DE), difference in gel electrophoresis (DIGE), and mass spectrometry (MS) to identify the plasma protein composition in pregnant female dogs and compared it with non-pregnant female dogs. We used the plasma samples obtained from four female dogs during I, II, and III thirds of pregnancy, three days after parturition, as well as from four non-pregnant female dogs in diestrus phase. Analysis of 2-DE gel image exhibited of 249 protein spots. The intensity of staining of 35 spots differed significantly (P < 0.05) between the non-pregnant and pregnant female dogs. We used matrix-assisted laser desorption/ionization-time of flight-mass spectrometry (MALDI TOF MS) to identify 47 spots corresponding to 52 different proteins. Five identified protein spots, including zinc finger BED domain-containing protein 5, hemoglobin subunit beta-2, integrator complex subunit 7, apolipoprotein A-I, and glutamyl aminopeptidase were differentially presented in the plasma of pregnant and non-pregnant female dogs. To the best of our knowledge, this is the first report on the plasma protein profile of pregnant and non-pregnant female dogs. In this study, we identified proteins that have not been previously identified in dogs. Our findings showed that numerous protein spots were differentially presented in the plasma of female dogs during normal pregnancy. Although we identified only a limited number of differentially presented proteins, our study demonstrated that the plasma protein profile changed during pregnancy in female dogs, which suggests its importance in maintaining pregnancy. Further studies are necessary to define complete plasma protein profile of pregnant female dogs and to identify all proteins that are differentially presented in the pregnant animals compared with the non-pregnant ones. In addition, studies are warranted to explain the role of those proteins in maintaining the pregnancy and their usefulness in detection of early pregnancy. Furthermore, our results indicated that DIGE technique is useful in the comparison of samples originated from different states and time points in dogs.
Severity of thought disorder predicts psychosis in persons at clinical high-risk
2015, Schizophrenia Research
Improving predictive accuracy is of paramount importance for early detection and prevention of psychosis. We sought a symptom severity classifier that would improve psychosis risk prediction.
Subjects were from two cohorts of the North American Prodrome Longitudinal Study. All subjects met Criteria of Psychosis-Risk States. In Cohort-1 (n = 296) we developed a classifier that included those items of the Scale of Psychosis-Risk Symptoms that best distinguished subjects who converted to psychosis from nonconverters, with performance initially validated by randomization tests in Cohort-1. Cohort-2 (n = 592) served as an independent test set.
We derived 2-Item and 4-Item subscales. Both included unusual thought content and suspiciousness; the latter added reduced ideational richness and difficulties with focus/concentration. The Concordance Index (C-Index), a measure of discrimination, was similar for each subscale across cohorts (4-Item subscale Cohort-2: 0.71, 95% CI = [0.64, 0.77], Cohort-1: 0.74, 95% CI = [0.69, 0.80]; 2-Item subscale Cohort-2: 0.68, 95% CI = [0.3, 0.76], Cohort-1: 0.72, 95% CI = [0.66–0.79]). The 4-Item performed better than the 2-Item subscale in 742/1000 random selections of 80% subsets of Cohort-2 subjects (p-value = 1.3E − 55). Subscale calibration between cohorts was proportional (higher scores/lower survival), but absolute conversion risk predicted from Cohort-1 was higher than that observed in Cohort-2, reflecting the cohorts' differences in 2-year conversion rates (Cohort-2: 0.16, 95% CI = [0.13, 0.19]; Cohort-1: 0.30, 95% CI = [0.24, 0.36]).
Severity of unusual thought content, suspiciousness, reduced ideational richness, and difficulty with focus/concentration informed psychosis risk prediction. Scales based on these symptoms may have utility in research and, assuming further validation, eventual clinical applications.
Analysis of splashing in basic oxygen furnace through systematic modelling
2015, IFAC-PapersOnLine
This paper presents the results of using a systematic approach for identifying the interactions between splashing and the variables measured from the basic oxygen furnace. Splashing is a phenomenon in BOF, where a small portion of material (both slag and steel) is splashing from the BOF in an uncontrolled manner. This causes unwanted material losses, which requires additional processing phases resulting to the loss of time and money. Splashing is an undesired phenomenon and thus its analysis is important. Earlier the analysis is carried out mainly manually while the systematic approach used in this paper uses forward-selection for selecting the significant variables and multivariable linear regression as a modelling technique for identifying a model between splashing and process variables. The results show that the procedures used are able to find a variable subset that can be used for explaining changes in splashing. Despite the promising results the process and the studied approach needs more research in the future. Especially, the procedure used needs to be complemented with data selection.
KNN classification - evaluated by repeated double cross validation: Recognition of minerals relevant for comet dust
2014, Chemometrics and Intelligent Laboratory Systems
Repeated double cross validation (rdCV) has recently been suggested as a careful and conservative strategy for optimizing and evaluating empirical multivariate calibration models. This evaluation strategy is adapted in this work for k-nearest neighbor (KNN) classification. The basics of rdCV are described, including the search for an optimum k, and tests with Italian Olive Oil Data. KNN-rdCV is applied to classify 17 mineral groups, relevant for the composition of comet dust particles, characterized by the peak heights at 20 selected masses in time-of-flight secondary ion mass spectra (TOF-SIMS). Predictive abilities for 15 mineral classes are > 95%, for two classes 75 and 85%.
Proteomics
2014, Handbook of Pharmacogenomics and Stratified Medicine
Proteomics is a fast and powerful discipline aimed at the study of the whole proteome or the sum of all proteins from an organism, tissue, cell or biofluid, or a subfraction thereof, resulting in an information-rich landscape of expressed proteins and their modulations under specific conditions. Most proteomic discoveries and efforts to date have been mainly directed towards the areas of cancer research, drug and drug target discovery and biomarker research. In this chapter, we will present the most common processes and workflows used in proteomic studies. To this end, we will summarize the main methodologies used for sample preparation and possible methods for the separation, detection, identification, and quantification of proteins/peptides. Key points for downstream processing, data analysis and evaluation will be also discussed. Finally, some relevant studies aimed at biomarker discovery or sub-cellular composition analysis that have been published using different proteomic workflows will be summarized.

View all citing articles on Scopus

^☆: This paper is part of a Special Issue dedicated to the 50th anniversary of Journal of Chromatography.

View full text

ReviewStatistical data processing in clinical proteomics☆

Abstract

Introduction

Section snippets

Feature selection

Discriminant analysis

Biomarker candidate selection

Comparison studies

Statistical validation

Proteomics data analysis: a framework

External test set

Conclusions

Acknowledgement

J. Chromatogr. B

Pattern Recogn.

TrAC, Trends Anal. Chem.

Anal. Chim. Acta

Chemom. Intell. Lab. Syst.

Lancet

Neoplasia

J. Multivar. Anal.

Mol. Cell. Proteomics

Kidney Int.

Artif. Intell. Med.

Neurocomputing

Neurocomputing

Neurocomputing

FEBS Lett.

BMC Bioinform.

TrAC Trends Anal. Chem.

Artif. Intell.

TrAC Trends Anal. Chem.

J. Natl. Cancer Inst.

Nature

Nature

Science

Proteomics

J. Proteome Res.

Clin. Chem.

Expert Rev. Proteomics

Anal. Chem.

Bju Int.

Mass Spectrom. Rev.

The Elements of Statistical Learning. Data mining, Inference and Prediction

Bioinformatics

Curr. Genomics

Mach. Learn.

BMC Bioinform.

Bioinformatics

BMC Bioinform.

Metabolomics

Soc. B

J.R. Statist. Soc. B

Proc. Natl. Acad. Sci. U.S.A.

Principal Component Analysis

Ann. Eugen.

J. Am. Stat. Assoc.

J. Am. Stat. Assoc.

Anal. Chem.

IEEE/ACM Trans. Comput. Biol. Bioinform.

J. Comput. Biol.

Siam J. Sci. Stat. Comput.

J. Chemom.

Review
Statistical data processing in clinical proteomics☆