Elsevier

Journal of Theoretical Biology

Volume 256, Issue 4, 21 February 2009, Pages 625-631
Journal of Theoretical Biology

Using support vector machines to distinguish enzymes: Approached by incorporating wavelet transform

https://doi.org/10.1016/j.jtbi.2008.10.026Get rights and content

Abstract

The enzymatic attributes of newly found protein sequences are usually determined either by biochemical analysis of eukaryotic and prokaryotic genomes or by microarray chips. These experimental methods are both time-consuming and costly. With the explosion of protein sequences registered in the databanks, it is highly desirable to develop an automated method to identify whether a given new sequence belongs to enzyme or non-enzyme. The discrete wavelet transform (DWT) and support vector machine (SVM) have been used in this study for distinguishing enzyme structures from non-enzymes. The networks have been trained and tested on two datasets of proteins with different wavelet basis functions, decomposition scales and hydrophobicity data types. Maximum accuracy has been obtained using SVM with a wavelet function of Bior2.4, a decomposition scale j=5, and Kyte–Doolittle hydrophobicity scales. The results obtained by the self-consistency test, jackknife test and independent dataset test are encouraging, which indicates that the proposed method can be employed as a useful assistant technique for distinguishing enzymes from non-enzymes.

Introduction

Enzyme is one of the most important biological catalysts in the metabolism of all organisms. Although the information about enzyme structure can be determined by conducting various experimental methods such as biochemical analysis of eukaryotic and prokaryotic genomes or microarray chips, which are both expensive and time-consuming (Chou and Elrod, 2003). Particularly, the number of newly found protein sequences has increased explosively in the post genomic era. For instance, in 1986 Swiss-Prot contained only 3939 protein sequence entries, but in 2008 the number has jumped to 385721 according to version 55.4 of the UniProtKB/Swiss-Prot Release as of May 20 2008, implying that the number of protein sequences has increased by about 55 times in about two decades. Facing such a “protein sequence explosion”, it is both challenging and indispensable to develop an automated method for fast and reliably annotating the enzyme attributes of uncharacterized proteins. The knowledge thus obtained can help us timely utilize these newly found protein sequences for research.

We aim to demonstrate that protein function can be predicted as enzymatic or not. Determination of a given new sequence belonging to enzyme or non-enzyme remains vital and challenging task using bioinformatics tools. In the past, a number of algorithms have been developed for this problem including sequence similarity (Baxevanis, 1998; Bork and Koonin, 1998; Schuler, 1998), evolutionary analysis (Eisen, 1998; Benner et al., 2000), hidden Markov models (Fujiwara and Asogawa, 2002), structural consideration (Teichmann et al., 2001; Di Gennaro et al., 2001), protein/gene fusion (Enright et al., 1999; Marcotte et al., 1999), protein interaction (Aravind, 2000; Bock and Gough, 2001), motifs (Hodges and Tsai, 2002), neural-networks (Fujiwara and Asogawa, 2002; Jensen et al., 2002a), and family classification by sequence clustering (Enright and Ozounis, 2000; Enright et al., 2002). In the absence of clear sequence or structural similarities, the criteria for comparison of distantly-related proteins become increasingly difficult to formulate (Enright and Ozounis, 2000). Moreover, not all homologous proteins have analogous functions (Benner et al., 2000). The presence of the shared domain within a group of proteins does not necessarily imply that these proteins perform the same function (Henikoff et al., 1997). Many proteins sharing promiscuous domains are known to have very different functions (Marcotte et al., 1999). Therefore, the development of new independent methodologies that are able to distinguish enzymes from protein sequences with advantages in some aspects can complement current methods and will further strengthen the ability to computationally predict enzyme structures.

In order to solve the problems mentioned above, a lot of modifications have been produced in the past few years. Support vector machine (SVM) based on amino acid sequence is one of them (Dobson and Doig, 2003, Dobson and Doig, 2005; Chou, 2005; Zhou et al., 2007). SVM is well-founded theoretically on statistical learning theory, and has many attractive features including effective avoidance of over-fitting, ability of handle large feature space and absence of local minima. However, as a machine learning technique, SVM requires a fixed length of pattern, it is not possible to use this technique in case of protein with too small or too large length (Sudipto and Raghava, 2006). Introduced in the early 1980s, wavelets have become a popular signal analysis tool due to their ability to elucidate simultaneously both spectral and temporal information within the signal. Wavelet transform (WT) is a local time-frequency analysis method with both changeable time window and frequency window. Because of its character of multi-resolution, WT has been applied in bioinformatics to analyze and process biological data recently (Lio, 2003; Mandell et al., 1997a; Mandell et al., 1998; Selz et al., 1998; Giuliani et al., 2000; Selz et al., 2004; Selz et al., 2007; Mandell et al., 1997b; Qiu et al., 2004, Qiu et al., 2003; Lu et al., 2004).

This paper is devoted to combining the discrete wavelet transform (DWT) based on the physicochemical property of residues and SVM to develop a new predictor for distinguishing enzyme structures from non-enzymes. Maximum accuracy has been obtained by the new method with biorthogonal 2.4 (Bior2.4) of decomposition scale j=5 for Kyte–Doolittle hydrophobicity scales (KDHΦ). It was observed that the accuracy using combined method was higher than WT or SVM only.

Section snippets

Datasets

To investigate the feasibility of our method, two datasets of proteins as a benchmark have been used. The training dataset is originally constructed by Paul and Andrew (2003). It contains 1178 high-resolution proteins in a structurally non-redundant subset of the Protein Data Bank, of which 691 are enzymes and 487 are non-enzymes on the basis of EC number, annotations in the PDB and Medline abstracts. In addition, the other dataset generated by Cai et al (2005) that consists of the accession

Discrete wavelet transform

In this work the discrete wavelet is the preferred wavelet representation. The WT of a series of hydrophobic values, H(x) is defined asT(a,b)=1a0xH(x)ψ(x-ba)dxwhere a is a scale variable and b is a translation variable; they belong to the real numbers r(n), and a>0. x is the amino acid sequence length of the protein, while ψ((x-b)/a) is the analyzing wavelet function. The transform coefficients T(a,b) are found for both specific locations on the signal, x=b, and for specific wavelet periods

Influence of decomposition levels

We exemplify the DWT on the task of extracting information hidden in protein sequences. Fig. 1 shows the KDHΦ (Kyte and Doolittle, 1982) plot of the protein 1A2J (Brookhaven Protein Databank accession: 1A2J_enzyme) and the wavelet decompose process from levels j=1 to 5. The intensity of the signal is indicated on the y axis; the x axis indicates the residue position along the sequence. S denotes the hydrophobicity plot of the protein 1A2J, cd1, cd2, cd3, cd4 and cd5 are five scales for levels

Discussion

The prediction of enzyme is an important and complex problem. In recent years, enzyme identification and classification problems have been investigated in several works (Chou and Elrod, 2003; Chou, 2005; Zhou et al., 2007; Cai et al., 2005; Jensen et al., 2002b; Cai et al., 2004; Cai and Chou, 2005; Chou and Cai, 2004). Most of these methods were based on the amino acid composition where the sample of a protein is represented by 20 discrete numbers, with each representing the occurrence

Conclusions

SVM is a powerful statistical learning method, and DWT analysis is an effective tool for extracting structural information on enzyme proteins from sequences. In this work, a novel predictive method has been proposed for the prediction of a given new sequence belongs to enzyme or non-enzyme by coupling SVM with DWT. In comparison with previous literatures, the predictive performance has been significantly enhanced. It is anticipated that current method can be a complementary tool for

Acknowledgments

This work was supported by grants from the National Natural Science Foundation of China (20605010, 20865003, 20805023), the Jiangxi Province Natural Science Foundation (2007JZH2644), the Jiangxi Province Education Office (GJJ08048), the Opening Foundation of State Key Laboratory of Chem/Biosensing and Chemometrics of Hunan University (2006022) and the Program for Innovative Research Team of Nanchang University.

References (55)

  • J.D. Qiu et al.

    Prediction of protein secondary structure based on continuous wavelet transform

    Talanta

    (2003)
  • K.A. Selz et al.

    Hydrophobic free energy eigenfunctions of pore, channel and transporter proteins contain β-burst patterns

    Biophysical J.

    (1998)
  • S.A. Teichmann et al.

    Determination of protein function, evolution and interactions by structural genomics

    Curr. Opin. Struct. Biol.

    (2001)
  • X.Y. Zhou et al.

    Using Chou's amphiphilic pseudo-amino acid composition and support vector machine for prediction of enzyme subfamily classes

    J. Theor. Biol.

    (2007)
  • L. Aravind

    Guilt by association: contextual information in genome analysis

    Genome. Res.

    (2000)
  • P. Baldi et al.

    Assessing the accuracy of prediction algorithms for classification: an overview

    Bioinformatics

    (2000)
  • A.D. Baxevanis

    Practical aspects of multiple sequence alignment

    Methods. Biochem. Anal.

    (1998)
  • J.R. Bock et al.

    Predicting protein–protein interactions from primary structure

    Bioinformatics

    (2001)
  • P. Bork et al.

    Predicting functions from protein sequences where are the bottlenecks?

    Nat. Genet.

    (1998)
  • C.J.C. Burges

    A tutorial on support vector machine for pattern recognition

    Data. Min. Knowl. Disc.

    (1998)
  • C.Z. Cai et al.

    Enzyme family classification by support vector machines

    Proteins Struct. Funct. Genet.

    (2004)
  • Y.D. Cai et al.

    Using functional domain composition to predict enzyme family classes

    J. Proteome. Res.

    (2005)
  • K.C. Chou

    Using amphiphilic pseudo amino acid composition to predict enzyme subfamily classes

    Bioinformatics

    (2005)
  • K.C. Chou et al.

    Prediction of enzyme family classes

    J. Proteome. Res.

    (2003)
  • K.C. Chou et al.

    Review: prediction of protein structural classes

    Crit. Rev. Biochem. Mol. Biol.

    (1995)
  • Daubechies, I., 1992. Ten lectures on wavelets. In: CBMS-NSF Regional Conference Series in Applied Mathematics. SIAM,...
  • J.A. Eisen

    Phylogenomics: improving functional predictions for uncharacterized genes by evolutionary analysis

    Genome. Res.

    (1998)
  • Cited by (16)

    • Identify submitochondria and subchloroplast locations with pseudo amino acid composition: Approach from the strategy of discrete wavelet transform feature extraction

      2011, Biochimica et Biophysica Acta - Molecular Cell Research
      Citation Excerpt :

      However, as elucidated in Ref. [9] and demonstrated by Eq.1 of Ref. [29], the jackknife test is deemed to be the most objective one that can always yield a unique result for a given benchmark dataset. Therefore, the jackknife test has been increasingly and widely used to test the powers of various statistical predictors (see, e.g., [58–76]). Accordingly we also adopted jackknife test to evaluate the powers of the prediction method proposed in this study.

    • AFP-Pred: A random forest approach for predicting antifreeze proteins from sequence-derived properties

      2011, Journal of Theoretical Biology
      Citation Excerpt :

      With the rapid increase of sequenced genomic data, the need for an automated and accurate tool to recognize AFP becomes increasingly important. Encouraged by the overwhelming success of machine learning methods in an engineering, medical and financial applications, many research groups have been using neural networks (NN), support vector machines (SVM), KNN, random forest and other machine learning algorithms in the biological field especially in the classification and prediction of protein structure and functional profile (Anand et al., 2008; Cai et al., 2004; Chou, 2001, 2005; Chou and Cai, 2005; Chou and Shen, 2009; Huang et al., 2009; Qiu et al., 2009). So far, bioinformatics and statistical learning methods like support vector machine and random forest have not been explored for the prediction of antifreeze proteins.

    View all citing articles on Scopus
    View full text