Gene selection from microarray data for cancer classification—a machine learning approach
Introduction
Accurate cancer diagnosis is vital for the successful application of specific therapies. Although cancer classification has improved over the last decade, there is still a need for a fully automated and less subjective method for cancer diagnosis. Recent studies demonstrated that DNA microarrays could provide useful information for cancer classification at the gene expression level due to their ability to measure the abundance of messenger ribonucleic acid (mRNA) transcripts for thousands of genes simultaneously.
Several machine learning algorithms have already been applied to classifying tumors using microarray data. Voting machines and self-organising maps (SOM) were used to analyse acute leukemia (Golub et al., 1999). Support vector machines (SVMs) were applied to multi-class cancer diagnosis by (Ramaswamy et al., 2001). Hierarchical clustering was used to analyse colon tumor (Alon et al., 1999). The best classification results are reported by Li et al. (2003) and Antonov et al. (2004). Li et al. employed a rule discovery method and Antonov et al. maximal margin linear programming (MAMA).
Given the nature of cancer microarray data, which usually consists of a few hundred samples with thousands of genes as features, the analysis has to be carried out carefully. Work in such a high dimensional space is extremely difficult if not impossible. One straightforward approach to select relevant genes is the application of standard parametric tests such as the t-test Thomas et al., 2001, Tsai et al., 2003 and a non-parametric test such as the Wilcoxon score test Thomas et al., 2001, Antoniadis et al., 2003. Wilks’s Lambda score was proposed by (Hwang et al., 2002) to access the discriminatory power of individual genes. A new procedure (Antonov et al., 2004) was designed to detect groups of genes that are strongly associated with a particular cancer type.
In this paper we consider two general approaches to feature subset selection, more specifically, wrapper and filter approaches, for gene selection. Wrappers and filters differ in how they evaluate feature subsets. Filter approaches remove irrelevant features according to general characteristics of the data. Wrapper approaches, by contrast, apply machine learning algorithms to feature subsets and use cross-validation to evaluate the score of feature subsets. Most methods of gene selection for microarray data analysis focus on filter approaches, although there are a few publications on applying wrapper approaches Inza et al., 2004, Xiong et al., 2001, Xing et al., 2001. Nevertheless, in theory, wrappers should provide more accurate classification results than filters (Langley, 1994). Wrappers use classifiers to estimate the usefulness of feature subsets. The use of “tailor-made” feature subsets should provide a better classification accuracy for the corresponding classifiers, since the features are selected according to their contribution to the classification accuracy of the classifiers. The disadvantage of the wrapper approach is its computational requirement when combined with sophisticated algorithms such as support vector machines.
As a filter approach, correlation-based feature selection (CFS) was proposed by Hall (1999). The rationale behind this algorithm is “a good feature subset is one that contains features highly correlated with the class, yet uncorrelated with each other.” It has been shown in Hall (1999) that CFS gave comparable results to the wrapper and executes many times faster. It will be shown later in this paper that combining CFS with decision trees, the naïve Bayes algorithm and SVM, provides classification accuracy on cancer microarray data that is similar or better than published results. The rest of this paper is organised as follows. We begin with a brief introduction to feature subset selection, followed by a description of feature wrappers, filters and CFS, which is essentially a filter algorithm. We discuss the advantages and disadvantages of using wrappers and filters to select feature subsets. Thereafter, we present the experimental results on acute leukemia and lymphoma microarray data. The last section discusses the results and concludes this paper.
Section snippets
Feature subset selection
We now define the basic notions used in the paper. Given a microarray cancer data set , which contains n samples from different cancer types or subtypes, we have to build a mathematical model which can map the samples to their classes. Each sample has m genes as its features. The assumption here is that not all genes measured by a microarray are related to cancer classification. Some genes are irrelevant and some are redundant from the machine learning point of view. It is well-known that the
Analysis of acute leukemia data
The acute leukemia data of Golub et al. (1999) consists of samples from two different types of acute leukemia, acute lymphoblastic leukemia (ALL) and acute myeloid leukemia (AML). The training data set has 38 bone marrow samples (27 ALL and 11 AML). Each sample has expression patterns of 7129 genes measured by the Affymetrix oligonucleotide microarray. The test data set consists of 24 bone marrow and 10 peripheral blood samples (20 ALL and 14 AML).
Feature-ranking filters provide a natural way
Discussion
We have shown in this paper that feature subset selection algorithms, namely wrappers, filters and CFS, can be very useful in extracting relevant information in microarray data analysis. Wrapper approaches can choose the best genes for building classifiers while filters can provide a nice overview by ranking the genes for the particular problem at the hand. CFS can choose genes which are highly correlated to cancers yet uncorrelated to each other.
When the methods agree and select the same
Acknowledgement
We would like to thank Dr. Marco Zaffalon for proofreading the manuscript and validating our results with his algorithm, Dr. Franceso Bertoni for his advice on lymphoma data analysis, and Annina Neumann for proofreading the manuscript. We are also grateful for the comments given by reviewers, which have significantly improved this paper.
References (38)
- et al.
Purification and characterization of zyxin, an 82,000-dalton component of adherens junctions
J. Biol. Chem.
(1991) - et al.
Filter versus wrapper gene selection approaches in DNA microarray domains
Artif. Intell. Med.
(2004) - et al.
A practical approach for feature selection
- et al.
p130CAS forms a signaling complex with the adapter protein crkl in hematopoietic cells transformed by the bcr/abl oncogene
J. Biol. Chem.
(1996) - et al.
Restoration of c/ebpalpha expression in a bcr-abl+ cell line induces terminal granulocytic differentiation
J. Biol. Chem.
(2003) - et al.
Role of zyxin in differential cell spreading and proliferation of melanoma cells and melanocytes
J. Invest. Dermatol.
(2002) - et al.
Zyxin and paxillin proteins: focal adhesion plaque lim domain proteins go nuclear
Biochim Biophys Acta
(2003) - et al.
Identification of a gene expression signature associated with pediatric aml prognosis
Blood
(2003) - et al.
Members of the zyxin family of lim proteins interact with members of the p130cas family of signal transducers
J. Biol. Chem.
(2002) - et al.
Identification of novel gene expression targets for the ras association domain family 1 (rassf1a) tumor suppressor gene in non-small cell lung cancer and neuroblastoma
Cancer Res.
(2003)
Distinct types of diffuse large b-cell lymphoma identified by gene expression profiling
Nature
Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays
Proc. Natl. Acad. Sci.
Effective dimension reduction methods for tumor classification using gene expression data
Bioinformatics
Optimization models for cancer classification: extracting gene interaction information from microarray expression data
Bioinformatics
Multi-interval discretization of continuous-valued attributes for classification learning
Data mining in bioinformatics using Weka
Bioinformatics
Support vector machine classification and validation of cancer tissue samples using microarray expression data
Bioinformatics
Molecular classification of cancer: class discovery and class prediction by gene expression monitoring
Science
Cited by (340)
A new filter-based gene selection approach in the DNA microarray domain
2024, Expert Systems with ApplicationsBenchmarking machine learning approaches to predict radiation-induced toxicities in lung cancer patients
2023, Clinical and Translational Radiation OncologyGene selection of microarray data using Heatmap Analysis and Graph Neural Network
2023, Applied Soft ComputingXML-GBM lung: An explainable machine learning-based application for the diagnosis of lung cancer
2023, Journal of Pathology InformaticsMeta-analysis of vaterite secondary data revealed the synthesis conditions for polymorphic control
2022, Chemical Engineering Research and DesignCitation Excerpt :Besides their limitations, DTs are able to solve a wide array of classification problems. For instance, among their applications can be cited citation networks (Shibata et al., 2012), pharmaceutical manufacturing process (Gams et al., 2014), modelling building energy demand (Yu et al., 2010), weather forecast (Sá et al., 2011), diagnosis of diseases (Wang et al., 2005; Karegowda et al., 2010), detection of forest fires (Stojanova et al., 2006), agriculture (Cunningham and Holmes, 1999), finance (Olson et al., 2012), computer vision and many more (Ali et al., 2012). The raw vaterite dataset comprised of a total of 256 experiments.