Journal of Molecular Biology
Regular articleAnalysis and prediction of functional sub-types from protein sequence alignments1
Introduction
Multiple sequence alignments are central to protein classification and analysis. When protein sequences are aligned, it becomes possible to see sequence conservation patterns that are indicative of, for example, enzyme active sites and secondary structure types (e.g. Zvelebil et al 1987, Casari et al 1995, Lichtarge et al 1996a). With such patterns, it is possible to derive motifs that encapsulate the features defining the protein family. Moreover, the aligned sequences can be used to construct sensitive profiles (e.g. Gribskov et al 1990, Birney et al 1996), or hidden Markov models (HMMs; e.g. Eddy 1998, Krogh et al 1994) that can be used to detect further, more remote members of a protein family. These techniques and others have aided greatly the detection of protein families and the associated construction of protein alignment databases, such as SMART (Schulz et al., 1998) and PFAM (Bateman et al., 1999), which are of growing importance in the analysis of data from large scale genome sequence projects.
However, the detection and alignment of sequences from diverse protein families creates new problems. Among these is the fact that homologous proteins frequently evolve different functions, which we hereafter refer to as a sub-type. It is common for proteins to evolve slightly different functions, such as different substrate specificities, or activities. In extreme cases, both enzymes and effector molecules (i.e. non-enzymes) can reside in the same homologous superfamily (e.g. Murzin, 1993), and ultimately proteins with similar folds can perform completely different functions (e.g. Russell et al., 1998). If a protein is of unknown function, but is found to belong to a diverse protein superfamily, or fold, with multiple functions, then determining functional sub-type becomes of great importance.
Often a perfect division of a protein family into sub-types can be accomplished by a simple phylogenetic analysis. In other words: sub-type correlates exactly, and it is clear that with the branches of a phylogenetic tree, therefore making the prediction of sub-type simply a matter of deciding into which branch a protein belongs. It is not surprising that most previous attempts to classify proteins have been very reliant on phylogenetic trees.
However, the division of proteins into functional sub-types cannot always be accomplished by phylogeny. If much time has passed since the evolution of different sub-types, then sequences may have diverged beyond the point where phylogeny can easily give a clear division. In addition, proteins usually have multiple features that co-evolve, such as differing affinities for more than one substrate, variations in sub-cellular location (e.g. membrane attached versus cytosolic) or the interaction with other proteins that differ across paralogues, even if other details, such as catalytic mechanism, remain unchanged. Finally, there remains the possibility that details of molecular function may evolve convergently (e.g. Makarova & Grishin, 1999). This is particularly likely in instances where specificity is conferred by only a handful of residues, or even a single position (e.g. Wu et al., 1999).
Various methods have been developed previously that attempt to address the problem of the analysis and prediction of protein sub-types from protein sequence alignments. Livingstone & Barton (1993) developed a method to annotate protein sequence alignments with the aim of highlighting positions of residue conservation. They made use of amino acid properties similar to those of Taylor (1986) and “sensible groups” provided from sequence similarity, functional or evolutionary criteria to highlight positions in the alignment conferring the unique features of a sub-group. The method was demonstrated graphically by analysis of SH2 and annexin domains, but to our knowledge, it has not been applied to the problem of predicting sub-types.
Casari et al. (1995) used a principle component analysis of a vector representation of sequences in space to develop an elegant method for identifying functional residues on proteins based on a multiple sequence alignment. Analysis of various dimensions in the vector sequence space gave both positions that are conserved across the whole protein family, in addition to residues specific to sub-types, either specified in advance, or determined from analysis of the sequence space itself. The method was successful at identifying positions determining specificity in the Ras-Rab-Rho superfamily, SH2 domains and cyclins. Subsequent studies have applied this method to alcohol dehydrogenases (Atrian et al., 1998), the ran-RCC1. interaction (Azuma et al., 1999), effector recognition by GTP-binding proteins (Bauer et al., 1999), and other families.
Lichtarge et al. (1996a) developed the “Evolutionary Trace” method, to determine important positions on protein sequences and structures that were of functional importance. Their method combined knowledge of protein structures with an evolutionary history derived from a phylogenetic tree to extract functionally important residues to identifying functional interfaces on protein surfaces. They made a distinction between positions conserved across all sequences, and those that vary only between subgroups (class-specific). In this way they were able to identify positions on protein structures that were important, both for features of the family as a whole, as well as for particular sub-types. The method has been applied to several protein families, including SH2, SH3, nuclear hormone receptors (Lichtarge et al., 1996a), G-proteins/ G-protein coupled receptors (Lichtarge et al., 1996b), zinc binding domains (Lichtarge et al., 1997) and the RGS/G-protein interaction (Sowa et al., 2000).
Sjolander (1998) developed a method of Phylogenetic Inference specifically designed for protein super-family analysis. Here, a phylogenetic tree is built for the input sequences based on nearest neighbour heuristics. The nodes in the tree are represented by a sequence profile of the sequences under that node, and the distance between two nodes is computed in terms of symmetric relative entropy, together with Dirichlet mixture priors. The method ensures that the highly conserved sites have higher weights while computing distances between nodes. The method was applied to SH2-domain containing proteins, resulting in new sub-family assignments for two proteins.
Here, we present another approach for studying protein sub-types associated with sequence alignments. Rather than attempt to define sub-types, we focused on the problems of identifying regions that confer specificity of sub-types already known (e.g. from experimental studies), and of predicting sub-types for “orphan” sequences (i.e. those where no sub-type is known).
Given a multiple sequence alignment and a classification of different sub-types (e.g. differences in enzyme specificity), the method exploits the differences between hidden Markov model profiles to highlight positions on the sequences that are most discerning of each sub-type. The method permits conservative substitutions, and tolerates missing data by combining alignments with amino acid exchange matrices via the construction of an HMM (Eddy, 1998). For new sequences known to be homologous to an existing family, but of unknown sub-type, the method can exploit the known sub-type classifications and associated profiles to predict sub-type. We demonstrate the method first by application to four well characterised protein families. We then perform a large scale assessment of sub-type prediction by applying the method to automatically derived sub-type groupings for 42 alignments from PFAM (Bateman et al., 1999). We discuss implications for experimental design, prediction of protein function, prediction of inter- and intra-protein distances, and applications to genome annotation.
Section snippets
Assessing the discerning value of amino acid positions
This procedure locates positions in a protein alignment that are best able to discriminate between different sub-types. Essentially this involves finding positions that are conserved within each sub-type, but that vary between the different sub-types.
Given an alignment A of sequences in family F, and the sub-types S1, S2, …, Sk of the sequences, we extract the sub-alignment Aj from A, corresponding to the sequences of sub-type Sj. We use the hmmbuild program of the HMMER 2.1.1 //hmmer.wustl.edu
Nucleotidyl cyclases
Nucleotidyl cyclases are a family of membrane attached or cytosolic domains that catalyse the reaction that forms a cyclic nucleotide monophosphate from a nucleotide triphosphate. The known cyclases act either on GTP (guanalyate cyclase) or ATP (adenylate cyclase). Mutations of two residue positions from Glu-Lys and Cys-Asp are known to be sufficient to change the specificity of the enzyme from GTP to ATP (Tucker et al., 1998). Mutations of several other residues near to the key Cys-Asp change
Discussion
We have presented and evaluated a method for assigning and analysing sub-types within protein sequence alignments. For four examples we have shown that the method is able to detect positions known to confer specificity in close agreement with experiment. Both on these four examples, and the 42 groupings derived from PFAM/ SWISSPROT, the method is shown to predict protein sub-types with remarkable success, even in the absence of closely related sequences of the same sub-type, and predictions are
Acknowledgements
We thank Jim Fickett and David Searls (SB) for encouragement and support. We are indebted greatly to Pankaj Agarwal (SB) for detailed help in the design of the method and the Z-score computation, in addition to other useful comments. Thanks also to Richard Copley (SB/EMBL Heidelberg), Malcolm Duckworth (SB), Mark Hurle (SB), Chris Larminie (SB), Chris Ponting (NCBI/Oxford) and Richard Jackson (University College, London) for helpful discussions. We are grateful to two referees for helpful and
References (58)
- et al.
Basic local alignment search tool
J. Mol. Biol.
(1990) - et al.
Model of the ran-RCC1 interaction using biochemical and docking experiments
J. Mol. Biol.
(1999) - et al.
Effector recognition by the small GTP-binding proteins Ras and Ral
J. Biol. Chem.
(1999) Errors in genome annotation
Trends Genet.
(1999)- et al.
Glycerol kinase from Escherichia coli and an Ala65→Thr mutantthe crystal structures reveal conformational changes with implications for allosteric regulation
Structure
(1998) - et al.
Co-evolution of proteins with their interaction partners
J. Mol. Biol.
(2000) - et al.
Profile analysis
Methods Enzymol.
(1990) - et al.
Hidden Markov models in computational biology. Applications to protein modeling
J. Mol. Biol.
(1994) - et al.
An evolutionary trace method defines binding surfaces common to protein families
J. Mol. Biol.
(1996) - et al.
Identification of functional surfaces of the zinc binding domains of intracellular receptors
J. Mol. Biol.
(1997)
The Zn-peptidase superfamilyfunctional convergence after evolutionary divergence
J. Mol. Biol.
Refined crystal structure of lipoamide dehydrogenase from Azotobacter vinelandii at 2.2 angstroms resolution. A comparison with the structure of glutathione reductase
J. Mol. Biol.
Can homologous proteins evolve different enzymatic activities?
Trends Biochem. Sci.
SCOPa structural classification of proteins database for the investigation of sequences and structures
J. Mol. Biol.
Correlated mutations contain information about protein-protein interaction
J. Mol. Biol.
Exploring structural homology in proteins
J. Mol. Biol.
Structural features can be unconserved in proteins with similar foldsan analysis of side-chain to side-chain contacts, secondary structure and acessibility
J. Mol. Biol.
Supersites within superfoldsbinding site similarity in the absence of homology
J. Mol. Biol.
RASMOLBiomolecular graphics for all
Trends Biochem. Sci.
The protein kinase resource
Trends Biochem. Sci.
The classification of amino acid conservation
J. Theoret. Biol.
Prediction of protein secondary structure and active sites using the alignment of homologous sequences
J. Mol. Biol.
Position-specific annotation of protein function based on multiple homologs
ISMB
Automatic extraction of keywords from scientific textapplication to the knowledge domain of protein families
Bioinformatics
Shaping of Drosophila alcohol dehydrogenase through evolutionrelationship with enzyme functionality
J. Mol. Evol.
The SWISSPROT protein sequence data bank and its new supplement TrEMBL in 1999
Nucl. Acids Res.
ALSCRIPTa tool to format multiple sequence alignments
Protein Eng.
Pfam 3.11313 multiple alignments and profile HMMs match the majority of proteins
Nucl. Acids Res.
PairWise and SearchWisefinding the optimal alignment in a simultaneous comparison of a protein profile against all DNA translation frames
Nucl. Acids. Res.
Cited by (224)
Computational prediction of protein functional sites—Applications in biotechnology and biomedicine
2022, Advances in Protein Chemistry and Structural BiologyCitation Excerpt :They differ on the way they look for an “optimal” subfamily classification from where to extract SDPs. One possibility is to explore different partitions (for example following the phylogenetic tree, as in ET), and look for the “optimal” one according with some criterion, such as that delivering the largest number of SDPs (Hannenhalli & Russell, 2000). Another possibility is to split the MSA into orthologous groups, as these tend to be functionally homogenous (Mirny & Gelfand, 2002).
Molecular modeling studies of the effects of withaferin A and its derivatives against oncoproteins associated with breast cancer stem cell activity
2021, Process BiochemistryCitation Excerpt :Hence, in the present study, functionally or structurally important sites on the protein surfaces of each target macromolecule were predicted, regarding their tendency to bind to the ligands based on their amino acid compositions. The active sites predicted for every target receptor were found to be consistent with the findings of previous studies [35–38]. Withaferin A is a steroidal lactone with a withanolide type-A skeleton.
Machine learning reveals sequence-function relationships in family 7 glycoside hydrolases
2021, Journal of Biological ChemistryInformation theoretic measures and mutagenesis identify a novel linchpin residue involved in substrate selection within the nucleotide-binding domain of an ABCG family exporter Cdr1p
2019, Archives of Biochemistry and BiophysicsCitation Excerpt :While analyzing the sequences of NBDs of key ABC exporters we noticed a highly conserved glutamine in the H-loop of ABCG/PDR members. When information theoretic measures were employed on NBD sequences of fungal PDR transporters which are part of the ABCG family, a very high Cumulative relative entropy (CRE) score was conferred to this residue position implying its family-specific functional significance [13]. In view of the available literature pertaining to the importance of H-loop in the functioning of ABC transporters, it is evident that while in case of ABC transporters such as HlyB, the H-loop histidine plays a key role in ATP hydrolysis (“catalytic dyad model”), the same is not true for the PDR transporters [14–17].
The ancestry of Antennapedia-like homeobox genes
2023, bioRxiv
- 1
Edited by J. Thornton
- 2
Present address: R. B. Russell; EMBL, Meyerhofstrasse 1, D-69117 Heidelberg, Germany.