Using support vector machines to distinguish enzymes: Approached by incorporating wavelet transform
Introduction
Enzyme is one of the most important biological catalysts in the metabolism of all organisms. Although the information about enzyme structure can be determined by conducting various experimental methods such as biochemical analysis of eukaryotic and prokaryotic genomes or microarray chips, which are both expensive and time-consuming (Chou and Elrod, 2003). Particularly, the number of newly found protein sequences has increased explosively in the post genomic era. For instance, in 1986 Swiss-Prot contained only 3939 protein sequence entries, but in 2008 the number has jumped to 385721 according to version 55.4 of the UniProtKB/Swiss-Prot Release as of May 20 2008, implying that the number of protein sequences has increased by about 55 times in about two decades. Facing such a “protein sequence explosion”, it is both challenging and indispensable to develop an automated method for fast and reliably annotating the enzyme attributes of uncharacterized proteins. The knowledge thus obtained can help us timely utilize these newly found protein sequences for research.
We aim to demonstrate that protein function can be predicted as enzymatic or not. Determination of a given new sequence belonging to enzyme or non-enzyme remains vital and challenging task using bioinformatics tools. In the past, a number of algorithms have been developed for this problem including sequence similarity (Baxevanis, 1998; Bork and Koonin, 1998; Schuler, 1998), evolutionary analysis (Eisen, 1998; Benner et al., 2000), hidden Markov models (Fujiwara and Asogawa, 2002), structural consideration (Teichmann et al., 2001; Di Gennaro et al., 2001), protein/gene fusion (Enright et al., 1999; Marcotte et al., 1999), protein interaction (Aravind, 2000; Bock and Gough, 2001), motifs (Hodges and Tsai, 2002), neural-networks (Fujiwara and Asogawa, 2002; Jensen et al., 2002a), and family classification by sequence clustering (Enright and Ozounis, 2000; Enright et al., 2002). In the absence of clear sequence or structural similarities, the criteria for comparison of distantly-related proteins become increasingly difficult to formulate (Enright and Ozounis, 2000). Moreover, not all homologous proteins have analogous functions (Benner et al., 2000). The presence of the shared domain within a group of proteins does not necessarily imply that these proteins perform the same function (Henikoff et al., 1997). Many proteins sharing promiscuous domains are known to have very different functions (Marcotte et al., 1999). Therefore, the development of new independent methodologies that are able to distinguish enzymes from protein sequences with advantages in some aspects can complement current methods and will further strengthen the ability to computationally predict enzyme structures.
In order to solve the problems mentioned above, a lot of modifications have been produced in the past few years. Support vector machine (SVM) based on amino acid sequence is one of them (Dobson and Doig, 2003, Dobson and Doig, 2005; Chou, 2005; Zhou et al., 2007). SVM is well-founded theoretically on statistical learning theory, and has many attractive features including effective avoidance of over-fitting, ability of handle large feature space and absence of local minima. However, as a machine learning technique, SVM requires a fixed length of pattern, it is not possible to use this technique in case of protein with too small or too large length (Sudipto and Raghava, 2006). Introduced in the early 1980s, wavelets have become a popular signal analysis tool due to their ability to elucidate simultaneously both spectral and temporal information within the signal. Wavelet transform (WT) is a local time-frequency analysis method with both changeable time window and frequency window. Because of its character of multi-resolution, WT has been applied in bioinformatics to analyze and process biological data recently (Lio, 2003; Mandell et al., 1997a; Mandell et al., 1998; Selz et al., 1998; Giuliani et al., 2000; Selz et al., 2004; Selz et al., 2007; Mandell et al., 1997b; Qiu et al., 2004, Qiu et al., 2003; Lu et al., 2004).
This paper is devoted to combining the discrete wavelet transform (DWT) based on the physicochemical property of residues and SVM to develop a new predictor for distinguishing enzyme structures from non-enzymes. Maximum accuracy has been obtained by the new method with biorthogonal 2.4 (Bior2.4) of decomposition scale j=5 for Kyte–Doolittle hydrophobicity scales (KDHΦ). It was observed that the accuracy using combined method was higher than WT or SVM only.
Section snippets
Datasets
To investigate the feasibility of our method, two datasets of proteins as a benchmark have been used. The training dataset is originally constructed by Paul and Andrew (2003). It contains 1178 high-resolution proteins in a structurally non-redundant subset of the Protein Data Bank, of which 691 are enzymes and 487 are non-enzymes on the basis of EC number, annotations in the PDB and Medline abstracts. In addition, the other dataset generated by Cai et al (2005) that consists of the accession
Discrete wavelet transform
In this work the discrete wavelet is the preferred wavelet representation. The WT of a series of hydrophobic values, is defined aswhere a is a scale variable and b is a translation variable; they belong to the real numbers , and . x is the amino acid sequence length of the protein, while is the analyzing wavelet function. The transform coefficients are found for both specific locations on the signal, , and for specific wavelet periods
Influence of decomposition levels
We exemplify the DWT on the task of extracting information hidden in protein sequences. Fig. 1 shows the KDHΦ (Kyte and Doolittle, 1982) plot of the protein 1A2J (Brookhaven Protein Databank accession: 1A2J_enzyme) and the wavelet decompose process from levels j=1 to 5. The intensity of the signal is indicated on the y axis; the x axis indicates the residue position along the sequence. S denotes the hydrophobicity plot of the protein 1A2J, cd1, cd2, cd3, cd4 and cd5 are five scales for levels
Discussion
The prediction of enzyme is an important and complex problem. In recent years, enzyme identification and classification problems have been investigated in several works (Chou and Elrod, 2003; Chou, 2005; Zhou et al., 2007; Cai et al., 2005; Jensen et al., 2002b; Cai et al., 2004; Cai and Chou, 2005; Chou and Cai, 2004). Most of these methods were based on the amino acid composition where the sample of a protein is represented by 20 discrete numbers, with each representing the occurrence
Conclusions
SVM is a powerful statistical learning method, and DWT analysis is an effective tool for extracting structural information on enzyme proteins from sequences. In this work, a novel predictive method has been proposed for the prediction of a given new sequence belongs to enzyme or non-enzyme by coupling SVM with DWT. In comparison with previous literatures, the predictive performance has been significantly enhanced. It is anticipated that current method can be a complementary tool for
Acknowledgments
This work was supported by grants from the National Natural Science Foundation of China (20605010, 20865003, 20805023), the Jiangxi Province Natural Science Foundation (2007JZH2644), the Jiangxi Province Education Office (GJJ08048), the Opening Foundation of State Key Laboratory of Chem/Biosensing and Chemometrics of Hunan University (2006022) and the Program for Innovative Research Team of Nanchang University.
References (55)
- et al.
Functional inferences from reconstructed evolutionary biology involving rectified databases—an evolutionarily grounded approach to functional genomics
Res. Microbiol.
(2000) - et al.
Predicting enzyme family classes by hybridizing gene product composition and pseudo-amino acid composition
J. Theor. Biol.
(2005) - et al.
Using GO-PseAA predictor to predict enzyme sub-class
Biochem. Biophys. Res. Commun.
(2004) - et al.
Enhanced functional annotation of protein sequences via the use of structural descriptors
J. Struct. Biol.
(2001) - et al.
Distinguishing enzyme structures from non-enzymes without alignments
J. Mol. Biol.
(2003) - et al.
Predicting enzyme class from protein structures without alignments
J. Mol. Biol.
(2005) - et al.
Prediction of human protein function from post-translational modifications and localization features
J. Mol. Biol.
(2002) - et al.
A simple method for displaying the hydropathic character of a protein
J. Mol. Biol.
(1982) - et al.
ECS: an automatic enzyme classifier based on functional domain composition
Comput. Biol. Chem.
(2007) - et al.
Wavelet transformation of protein hydrophobicity sequences suggests their memberships in structural families
Physica A
(1997)
Prediction of protein secondary structure based on continuous wavelet transform
Talanta
Hydrophobic free energy eigenfunctions of pore, channel and transporter proteins contain β-burst patterns
Biophysical J.
Determination of protein function, evolution and interactions by structural genomics
Curr. Opin. Struct. Biol.
Using Chou's amphiphilic pseudo-amino acid composition and support vector machine for prediction of enzyme subfamily classes
J. Theor. Biol.
Guilt by association: contextual information in genome analysis
Genome. Res.
Assessing the accuracy of prediction algorithms for classification: an overview
Bioinformatics
Practical aspects of multiple sequence alignment
Methods. Biochem. Anal.
Predicting protein–protein interactions from primary structure
Bioinformatics
Predicting functions from protein sequences where are the bottlenecks?
Nat. Genet.
A tutorial on support vector machine for pattern recognition
Data. Min. Knowl. Disc.
Enzyme family classification by support vector machines
Proteins Struct. Funct. Genet.
Using functional domain composition to predict enzyme family classes
J. Proteome. Res.
Using amphiphilic pseudo amino acid composition to predict enzyme subfamily classes
Bioinformatics
Prediction of enzyme family classes
J. Proteome. Res.
Review: prediction of protein structural classes
Crit. Rev. Biochem. Mol. Biol.
Phylogenomics: improving functional predictions for uncharacterized genes by evolutionary analysis
Genome. Res.
Cited by (16)
Identify submitochondria and subchloroplast locations with pseudo amino acid composition: Approach from the strategy of discrete wavelet transform feature extraction
2011, Biochimica et Biophysica Acta - Molecular Cell ResearchCitation Excerpt :However, as elucidated in Ref. [9] and demonstrated by Eq.1 of Ref. [29], the jackknife test is deemed to be the most objective one that can always yield a unique result for a given benchmark dataset. Therefore, the jackknife test has been increasingly and widely used to test the powers of various statistical predictors (see, e.g., [58–76]). Accordingly we also adopted jackknife test to evaluate the powers of the prediction method proposed in this study.
AFP-Pred: A random forest approach for predicting antifreeze proteins from sequence-derived properties
2011, Journal of Theoretical BiologyCitation Excerpt :With the rapid increase of sequenced genomic data, the need for an automated and accurate tool to recognize AFP becomes increasingly important. Encouraged by the overwhelming success of machine learning methods in an engineering, medical and financial applications, many research groups have been using neural networks (NN), support vector machines (SVM), KNN, random forest and other machine learning algorithms in the biological field especially in the classification and prediction of protein structure and functional profile (Anand et al., 2008; Cai et al., 2004; Chou, 2001, 2005; Chou and Cai, 2005; Chou and Shen, 2009; Huang et al., 2009; Qiu et al., 2009). So far, bioinformatics and statistical learning methods like support vector machine and random forest have not been explored for the prediction of antifreeze proteins.
Predicting enzymatic function of protein sequences with attention
2023, BioinformaticsEnzyme Substrate Prediction from Three-Dimensional Feature Representations Using Space-Filling Curves
2023, Journal of Chemical Information and Modeling