Using support vector machines to distinguish enzymes: Approached by incorporating wavelet transform

doi:10.1016/j.jtbi.2008.10.026

Journal of Theoretical Biology

Volume 256, Issue 4, 21 February 2009, Pages 625-631

https://doi.org/10.1016/j.jtbi.2008.10.026 Get rights and content

Abstract

The enzymatic attributes of newly found protein sequences are usually determined either by biochemical analysis of eukaryotic and prokaryotic genomes or by microarray chips. These experimental methods are both time-consuming and costly. With the explosion of protein sequences registered in the databanks, it is highly desirable to develop an automated method to identify whether a given new sequence belongs to enzyme or non-enzyme. The discrete wavelet transform (DWT) and support vector machine (SVM) have been used in this study for distinguishing enzyme structures from non-enzymes. The networks have been trained and tested on two datasets of proteins with different wavelet basis functions, decomposition scales and hydrophobicity data types. Maximum accuracy has been obtained using SVM with a wavelet function of Bior2.4, a decomposition scale j=5, and Kyte–Doolittle hydrophobicity scales. The results obtained by the self-consistency test, jackknife test and independent dataset test are encouraging, which indicates that the proposed method can be employed as a useful assistant technique for distinguishing enzymes from non-enzymes.

Introduction

Enzyme is one of the most important biological catalysts in the metabolism of all organisms. Although the information about enzyme structure can be determined by conducting various experimental methods such as biochemical analysis of eukaryotic and prokaryotic genomes or microarray chips, which are both expensive and time-consuming (Chou and Elrod, 2003). Particularly, the number of newly found protein sequences has increased explosively in the post genomic era. For instance, in 1986 Swiss-Prot contained only 3939 protein sequence entries, but in 2008 the number has jumped to 385721 according to version 55.4 of the UniProtKB/Swiss-Prot Release as of May 20 2008, implying that the number of protein sequences has increased by about 55 times in about two decades. Facing such a “protein sequence explosion”, it is both challenging and indispensable to develop an automated method for fast and reliably annotating the enzyme attributes of uncharacterized proteins. The knowledge thus obtained can help us timely utilize these newly found protein sequences for research.

We aim to demonstrate that protein function can be predicted as enzymatic or not. Determination of a given new sequence belonging to enzyme or non-enzyme remains vital and challenging task using bioinformatics tools. In the past, a number of algorithms have been developed for this problem including sequence similarity (Baxevanis, 1998; Bork and Koonin, 1998; Schuler, 1998), evolutionary analysis (Eisen, 1998; Benner et al., 2000), hidden Markov models (Fujiwara and Asogawa, 2002), structural consideration (Teichmann et al., 2001; Di Gennaro et al., 2001), protein/gene fusion (Enright et al., 1999; Marcotte et al., 1999), protein interaction (Aravind, 2000; Bock and Gough, 2001), motifs (Hodges and Tsai, 2002), neural-networks (Fujiwara and Asogawa, 2002; Jensen et al., 2002a), and family classification by sequence clustering (Enright and Ozounis, 2000; Enright et al., 2002). In the absence of clear sequence or structural similarities, the criteria for comparison of distantly-related proteins become increasingly difficult to formulate (Enright and Ozounis, 2000). Moreover, not all homologous proteins have analogous functions (Benner et al., 2000). The presence of the shared domain within a group of proteins does not necessarily imply that these proteins perform the same function (Henikoff et al., 1997). Many proteins sharing promiscuous domains are known to have very different functions (Marcotte et al., 1999). Therefore, the development of new independent methodologies that are able to distinguish enzymes from protein sequences with advantages in some aspects can complement current methods and will further strengthen the ability to computationally predict enzyme structures.

In order to solve the problems mentioned above, a lot of modifications have been produced in the past few years. Support vector machine (SVM) based on amino acid sequence is one of them (Dobson and Doig, 2003, Dobson and Doig, 2005; Chou, 2005; Zhou et al., 2007). SVM is well-founded theoretically on statistical learning theory, and has many attractive features including effective avoidance of over-fitting, ability of handle large feature space and absence of local minima. However, as a machine learning technique, SVM requires a fixed length of pattern, it is not possible to use this technique in case of protein with too small or too large length (Sudipto and Raghava, 2006). Introduced in the early 1980s, wavelets have become a popular signal analysis tool due to their ability to elucidate simultaneously both spectral and temporal information within the signal. Wavelet transform (WT) is a local time-frequency analysis method with both changeable time window and frequency window. Because of its character of multi-resolution, WT has been applied in bioinformatics to analyze and process biological data recently (Lio, 2003; Mandell et al., 1997a; Mandell et al., 1998; Selz et al., 1998; Giuliani et al., 2000; Selz et al., 2004; Selz et al., 2007; Mandell et al., 1997b; Qiu et al., 2004, Qiu et al., 2003; Lu et al., 2004).

This paper is devoted to combining the discrete wavelet transform (DWT) based on the physicochemical property of residues and SVM to develop a new predictor for distinguishing enzyme structures from non-enzymes. Maximum accuracy has been obtained by the new method with biorthogonal 2.4 (Bior2.4) of decomposition scale j=5 for Kyte–Doolittle hydrophobicity scales (KDHΦ). It was observed that the accuracy using combined method was higher than WT or SVM only.

Section snippets

Datasets

To investigate the feasibility of our method, two datasets of proteins as a benchmark have been used. The training dataset is originally constructed by Paul and Andrew (2003). It contains 1178 high-resolution proteins in a structurally non-redundant subset of the Protein Data Bank, of which 691 are enzymes and 487 are non-enzymes on the basis of EC number, annotations in the PDB and Medline abstracts. In addition, the other dataset generated by Cai et al (2005) that consists of the accession

Discrete wavelet transform

In this work the discrete wavelet is the preferred wavelet representation. The WT of a series of hydrophobic values, $H (x)$ is defined as $T (a, b) = \frac{1}{\sqrt{a}} \int_{0}^{x} H (x) ψ (\frac{x - b}{a}) d x$ where a is a scale variable and b is a translation variable; they belong to the real numbers $r (n)$ , and $a > 0$ . x is the amino acid sequence length of the protein, while $ψ ((x - b) / a)$ is the analyzing wavelet function. The transform coefficients $T (a, b)$ are found for both specific locations on the signal, $x = b$ , and for specific wavelet periods

Influence of decomposition levels

We exemplify the DWT on the task of extracting information hidden in protein sequences. Fig. 1 shows the KDHΦ (Kyte and Doolittle, 1982) plot of the protein 1A2J (Brookhaven Protein Databank accession: 1A2J_enzyme) and the wavelet decompose process from levels j=1 to 5. The intensity of the signal is indicated on the y axis; the x axis indicates the residue position along the sequence. S denotes the hydrophobicity plot of the protein 1A2J, cd1, cd2, cd3, cd4 and cd5 are five scales for levels

Discussion

The prediction of enzyme is an important and complex problem. In recent years, enzyme identification and classification problems have been investigated in several works (Chou and Elrod, 2003; Chou, 2005; Zhou et al., 2007; Cai et al., 2005; Jensen et al., 2002b; Cai et al., 2004; Cai and Chou, 2005; Chou and Cai, 2004). Most of these methods were based on the amino acid composition where the sample of a protein is represented by 20 discrete numbers, with each representing the occurrence

Conclusions

SVM is a powerful statistical learning method, and DWT analysis is an effective tool for extracting structural information on enzyme proteins from sequences. In this work, a novel predictive method has been proposed for the prediction of a given new sequence belongs to enzyme or non-enzyme by coupling SVM with DWT. In comparison with previous literatures, the predictive performance has been significantly enhanced. It is anticipated that current method can be a complementary tool for

Acknowledgments

This work was supported by grants from the National Natural Science Foundation of China (20605010, 20865003, 20805023), the Jiangxi Province Natural Science Foundation (2007JZH2644), the Jiangxi Province Education Office (GJJ08048), the Opening Foundation of State Key Laboratory of Chem/Biosensing and Chemometrics of Hunan University (2006022) and the Program for Innovative Research Team of Nanchang University.

References (55)

S.A. Benner et al.
Functional inferences from reconstructed evolutionary biology involving rectified databases—an evolutionarily grounded approach to functional genomics
Res. Microbiol.
(2000)
Y.D. Cai et al.
Predicting enzyme family classes by hybridizing gene product composition and pseudo-amino acid composition
J. Theor. Biol.
(2005)
K.C. Chou et al.
Using GO-PseAA predictor to predict enzyme sub-class
Biochem. Biophys. Res. Commun.
(2004)
J.A. Di Gennaro et al.
Enhanced functional annotation of protein sequences via the use of structural descriptors
J. Struct. Biol.
(2001)
P.D. Dobson et al.
Distinguishing enzyme structures from non-enzymes without alignments
J. Mol. Biol.
(2003)
P.D. Dobson et al.
Predicting enzyme class from protein structures without alignments
J. Mol. Biol.
(2005)
L.J. Jensen et al.
Prediction of human protein function from post-translational modifications and localization features
J. Mol. Biol.
(2002)
J. Kyte et al.
A simple method for displaying the hydropathic character of a protein
J. Mol. Biol.
(1982)
L.Y. Lu et al.
ECS: an automatic enzyme classifier based on functional domain composition
Comput. Biol. Chem.
(2007)
A.J. Mandell et al.
Wavelet transformation of protein hydrophobicity sequences suggests their memberships in structural families
Physica A
(1997)

J.D. Qiu et al.

Prediction of protein secondary structure based on continuous wavelet transform

Talanta

(2003)

K.A. Selz et al.

Hydrophobic free energy eigenfunctions of pore, channel and transporter proteins contain β-burst patterns

Biophysical J.

(1998)

S.A. Teichmann et al.

Determination of protein function, evolution and interactions by structural genomics

Curr. Opin. Struct. Biol.

(2001)

X.Y. Zhou et al.

Using Chou's amphiphilic pseudo-amino acid composition and support vector machine for prediction of enzyme subfamily classes

J. Theor. Biol.

(2007)

L. Aravind

Guilt by association: contextual information in genome analysis

Genome. Res.

(2000)

P. Baldi et al.

Assessing the accuracy of prediction algorithms for classification: an overview

Bioinformatics

(2000)

A.D. Baxevanis

Practical aspects of multiple sequence alignment

Methods. Biochem. Anal.

(1998)

J.R. Bock et al.

Predicting protein–protein interactions from primary structure

Bioinformatics

(2001)

P. Bork et al.

Predicting functions from protein sequences where are the bottlenecks?

Nat. Genet.

(1998)

C.J.C. Burges

A tutorial on support vector machine for pattern recognition

Data. Min. Knowl. Disc.

(1998)

C.Z. Cai et al.

Enzyme family classification by support vector machines

Proteins Struct. Funct. Genet.

(2004)

Y.D. Cai et al.

Using functional domain composition to predict enzyme family classes

J. Proteome. Res.

(2005)

K.C. Chou

Using amphiphilic pseudo amino acid composition to predict enzyme subfamily classes

Bioinformatics

(2005)

K.C. Chou et al.

Prediction of enzyme family classes

J. Proteome. Res.

(2003)

K.C. Chou et al.

Review: prediction of protein structural classes

Crit. Rev. Biochem. Mol. Biol.

(1995)

Daubechies, I., 1992. Ten lectures on wavelets. In: CBMS-NSF Regional Conference Series in Applied Mathematics. SIAM,...

J.A. Eisen

Phylogenomics: improving functional predictions for uncharacterized genes by evolutionary analysis

Genome. Res.

(1998)

Cited by (16)

Predicting homo-oligomers and hetero-oligomers by pseudo-amino acid composition: An approach from discrete wavelet transformation
2011, Biochimie
Many proteins exist in vivo as oligomers with different quaternary structural attributes rather than as individual chains. These proteins are the structural components of various biological functions, including cooperative effects, allosteric mechanisms and ion-channel gating. With the dramatic increase in the number of protein sequences submitted to the public databank, it is important for both basic research and drug discovery research to acquire the knowledge about possible quaternary structural attributes of their interested proteins in a timely manner. A high-throughput method (DWT_SVM), fusing discrete wavelet transform (DWT) and support vector machine (SVM) classifier algorithm with various physicochemical features, has been developed to predict protein quaternary structure. The accuracy in distinguishing candidate proteins as homo-oligomer or hetero-oligomer using the dataset R₂₇₂₀ was 85.95% and 85.49% respectively by jackknife, showing that DWT_SVM is guide promising in predicting protein quaternary structures. The online service is available at http://bioinfo.ncu.edu.cn/Services.aspx. Protein sequences in FASTA format can be directly fed to the system OligoPred. The processed results will be presented in a diagram that includes the information of feature extraction and the classification error rate.
Identify submitochondria and subchloroplast locations with pseudo amino acid composition: Approach from the strategy of discrete wavelet transform feature extraction
2011, Biochimica et Biophysica Acta - Molecular Cell Research
Citation Excerpt :
However, as elucidated in Ref. [9] and demonstrated by Eq.1 of Ref. [29], the jackknife test is deemed to be the most objective one that can always yield a unique result for a given benchmark dataset. Therefore, the jackknife test has been increasingly and widely used to test the powers of various statistical predictors (see, e.g., [58–76]). Accordingly we also adopted jackknife test to evaluate the powers of the prediction method proposed in this study.
It is very challenging and complicated to predict protein locations at the sub-subcellular level. The key to enhancing the prediction quality for protein sub-subcellular locations is to grasp the core features of a protein that can discriminate among proteins with different subcompartment locations. In this study, a different formulation of pseudoamino acid composition by the approach of discrete wavelet transform feature extraction was developed to predict submitochondria and subchloroplast locations. As a result of jackknife cross-validation, with our method, it can efficiently distinguish mitochondrial proteins from chloroplast proteins with total accuracy of 98.8% and obtained a promising total accuracy of 93.38% for predicting submitochondria locations. Especially the predictive accuracy for mitochondrial outer membrane and chloroplast thylakoid lumen were 82.93% and 82.22%, respectively, showing an improvement of 4.88% and 27.22% when other existing methods were compared. The results indicated that the proposed method might be employed as a useful assistant technique for identifying sub-subcellular locations. We have implemented our algorithm as an online service called SubIdent (http://bioinfo.ncu.edu.cn/services.aspx).
AFP-Pred: A random forest approach for predicting antifreeze proteins from sequence-derived properties
2011, Journal of Theoretical Biology
Citation Excerpt :
With the rapid increase of sequenced genomic data, the need for an automated and accurate tool to recognize AFP becomes increasingly important. Encouraged by the overwhelming success of machine learning methods in an engineering, medical and financial applications, many research groups have been using neural networks (NN), support vector machines (SVM), KNN, random forest and other machine learning algorithms in the biological field especially in the classification and prediction of protein structure and functional profile (Anand et al., 2008; Cai et al., 2004; Chou, 2001, 2005; Chou and Cai, 2005; Chou and Shen, 2009; Huang et al., 2009; Qiu et al., 2009). So far, bioinformatics and statistical learning methods like support vector machine and random forest have not been explored for the prediction of antifreeze proteins.
Some creatures living in extremely low temperatures can produce some special materials called “antifreeze proteins” (AFPs), which can prevent the cell and body fluids from freezing. AFPs are present in vertebrates, invertebrates, plants, bacteria, fungi, etc. Although AFPs have a common function, they show a high degree of diversity in sequences and structures. Therefore, sequence similarity based search methods often fails to predict AFPs from sequence databases. In this work, we report a random forest approach “AFP-Pred” for the prediction of antifreeze proteins from protein sequence. AFP-Pred was trained on the dataset containing 300 AFPs and 300 non-AFPs and tested on the dataset containing 181 AFPs and 9193 non-AFPs. AFP-Pred achieved 81.33% accuracy from training and 83.38% from testing. The performance of AFP-Pred was compared with BLAST and HMM. High prediction accuracy and successful of prediction of hypothetical proteins suggests that AFP-Pred can be a useful approach to identify antifreeze proteins from sequence information, irrespective of their sequence similarity.
Predicting enzymatic function of protein sequences with attention
2023, Bioinformatics
Enzyme Substrate Prediction from Three-Dimensional Feature Representations Using Space-Filling Curves
2023, Journal of Chemical Information and Modeling
deepNEC: a novel alignment-free tool for the identification and classification of nitrogen biochemical network-related enzymes using deep learning
2022, Briefings in Bioinformatics

View all citing articles on Scopus

View full text

Using support vector machines to distinguish enzymes: Approached by incorporating wavelet transform

Abstract

Introduction

Section snippets

Datasets

Discrete wavelet transform

Influence of decomposition levels

Discussion

Conclusions

Acknowledgments

Res. Microbiol.

J. Theor. Biol.

Biochem. Biophys. Res. Commun.

J. Struct. Biol.

J. Mol. Biol.

J. Mol. Biol.

J. Mol. Biol.

J. Mol. Biol.

Comput. Biol. Chem.

Physica A

Talanta

Biophysical J.

Curr. Opin. Struct. Biol.

J. Theor. Biol.

Guilt by association: contextual information in genome analysis

Genome. Res.

Assessing the accuracy of prediction algorithms for classification: an overview

Bioinformatics

Practical aspects of multiple sequence alignment

Methods. Biochem. Anal.

Predicting protein–protein interactions from primary structure

Bioinformatics

Predicting functions from protein sequences where are the bottlenecks?

Nat. Genet.

A tutorial on support vector machine for pattern recognition

Data. Min. Knowl. Disc.

Enzyme family classification by support vector machines

Proteins Struct. Funct. Genet.

Using functional domain composition to predict enzyme family classes

J. Proteome. Res.

Using amphiphilic pseudo amino acid composition to predict enzyme subfamily classes

Bioinformatics

Prediction of enzyme family classes

J. Proteome. Res.

Review: prediction of protein structural classes

Crit. Rev. Biochem. Mol. Biol.

Phylogenomics: improving functional predictions for uncharacterized genes by evolutionary analysis

Genome. Res.