Analysis and prediction of functional sub-types from protein sequence alignments

doi:10.1006/jmbi.2000.4036

Journal of Molecular Biology

Volume 303, Issue 1, 13 October 2000, Pages 61-76

https://doi.org/10.1006/jmbi.2000.4036 Get rights and content

Abstract

The increasing number and diversity of protein sequence families requires new methods to define and predict details regarding function. Here, we present a method for analysis and prediction of functional sub-types from multiple protein sequence alignments. Given an alignment and set of proteins grouped into sub-types according to some definition of function, such as enzymatic specificity, the method identifies positions that are indicative of functional differences by comparison of sub-type specific sequence profiles, and analysis of positional entropy in the alignment. Alignment positions with significantly high positional relative entropy correlate with those known to be involved in defining sub-types for nucleotidyl cyclases, protein kinases, lactate/malate dehydrogenases and trypsin-like serine proteases. We highlight new positions for these proteins that suggest additional experiments to elucidate the basis of specificity. The method is also able to predict sub-type for unclassified sequences. We assess several variations on a prediction method, and compare them to simple sequence comparisons. For assessment, we remove close homologues to the sequence for which a prediction is to be made (by a sequence identity above a threshold). This simulates situations where a protein is known to belong to a protein family, but is not a close relative of another protein of known sub-type. Considering the four families above, and a sequence identity threshold of 30 %, our best method gives an accuracy of 96 % compared to 80 % obtained for sequence similarity and 74 % for BLAST. We describe the derivation of a set of sub-type groupings derived from an automated parsing of alignments from PFAM and the SWISSPROT database, and use this to perform a large-scale assessment. The best method gives an average accuracy of 94 % compared to 68 % for sequence similarity and 79 % for BLAST. We discuss implications for experimental design, genome annotation and the prediction of protein function and protein intra-residue distances.

Introduction

Multiple sequence alignments are central to protein classification and analysis. When protein sequences are aligned, it becomes possible to see sequence conservation patterns that are indicative of, for example, enzyme active sites and secondary structure types (e.g. Zvelebil et al 1987, Casari et al 1995, Lichtarge et al 1996a). With such patterns, it is possible to derive motifs that encapsulate the features defining the protein family. Moreover, the aligned sequences can be used to construct sensitive profiles (e.g. Gribskov et al 1990, Birney et al 1996), or hidden Markov models (HMMs; e.g. Eddy 1998, Krogh et al 1994) that can be used to detect further, more remote members of a protein family. These techniques and others have aided greatly the detection of protein families and the associated construction of protein alignment databases, such as SMART (Schulz et al., 1998) and PFAM (Bateman et al., 1999), which are of growing importance in the analysis of data from large scale genome sequence projects.

However, the detection and alignment of sequences from diverse protein families creates new problems. Among these is the fact that homologous proteins frequently evolve different functions, which we hereafter refer to as a sub-type. It is common for proteins to evolve slightly different functions, such as different substrate specificities, or activities. In extreme cases, both enzymes and effector molecules (i.e. non-enzymes) can reside in the same homologous superfamily (e.g. Murzin, 1993), and ultimately proteins with similar folds can perform completely different functions (e.g. Russell et al., 1998). If a protein is of unknown function, but is found to belong to a diverse protein superfamily, or fold, with multiple functions, then determining functional sub-type becomes of great importance.

Often a perfect division of a protein family into sub-types can be accomplished by a simple phylogenetic analysis. In other words: sub-type correlates exactly, and it is clear that with the branches of a phylogenetic tree, therefore making the prediction of sub-type simply a matter of deciding into which branch a protein belongs. It is not surprising that most previous attempts to classify proteins have been very reliant on phylogenetic trees.

However, the division of proteins into functional sub-types cannot always be accomplished by phylogeny. If much time has passed since the evolution of different sub-types, then sequences may have diverged beyond the point where phylogeny can easily give a clear division. In addition, proteins usually have multiple features that co-evolve, such as differing affinities for more than one substrate, variations in sub-cellular location (e.g. membrane attached versus cytosolic) or the interaction with other proteins that differ across paralogues, even if other details, such as catalytic mechanism, remain unchanged. Finally, there remains the possibility that details of molecular function may evolve convergently (e.g. Makarova & Grishin, 1999). This is particularly likely in instances where specificity is conferred by only a handful of residues, or even a single position (e.g. Wu et al., 1999).

Various methods have been developed previously that attempt to address the problem of the analysis and prediction of protein sub-types from protein sequence alignments. Livingstone & Barton (1993) developed a method to annotate protein sequence alignments with the aim of highlighting positions of residue conservation. They made use of amino acid properties similar to those of Taylor (1986) and “sensible groups” provided from sequence similarity, functional or evolutionary criteria to highlight positions in the alignment conferring the unique features of a sub-group. The method was demonstrated graphically by analysis of SH2 and annexin domains, but to our knowledge, it has not been applied to the problem of predicting sub-types.

Casari et al. (1995) used a principle component analysis of a vector representation of sequences in space to develop an elegant method for identifying functional residues on proteins based on a multiple sequence alignment. Analysis of various dimensions in the vector sequence space gave both positions that are conserved across the whole protein family, in addition to residues specific to sub-types, either specified in advance, or determined from analysis of the sequence space itself. The method was successful at identifying positions determining specificity in the Ras-Rab-Rho superfamily, SH2 domains and cyclins. Subsequent studies have applied this method to alcohol dehydrogenases (Atrian et al., 1998), the ran-RCC1. interaction (Azuma et al., 1999), effector recognition by GTP-binding proteins (Bauer et al., 1999), and other families.

Lichtarge et al. (1996a) developed the “Evolutionary Trace” method, to determine important positions on protein sequences and structures that were of functional importance. Their method combined knowledge of protein structures with an evolutionary history derived from a phylogenetic tree to extract functionally important residues to identifying functional interfaces on protein surfaces. They made a distinction between positions conserved across all sequences, and those that vary only between subgroups (class-specific). In this way they were able to identify positions on protein structures that were important, both for features of the family as a whole, as well as for particular sub-types. The method has been applied to several protein families, including SH2, SH3, nuclear hormone receptors (Lichtarge et al., 1996a), G-proteins/ G-protein coupled receptors (Lichtarge et al., 1996b), zinc binding domains (Lichtarge et al., 1997) and the RGS/G-protein interaction (Sowa et al., 2000).

Sjolander (1998) developed a method of Phylogenetic Inference specifically designed for protein super-family analysis. Here, a phylogenetic tree is built for the input sequences based on nearest neighbour heuristics. The nodes in the tree are represented by a sequence profile of the sequences under that node, and the distance between two nodes is computed in terms of symmetric relative entropy, together with Dirichlet mixture priors. The method ensures that the highly conserved sites have higher weights while computing distances between nodes. The method was applied to SH2-domain containing proteins, resulting in new sub-family assignments for two proteins.

Here, we present another approach for studying protein sub-types associated with sequence alignments. Rather than attempt to define sub-types, we focused on the problems of identifying regions that confer specificity of sub-types already known (e.g. from experimental studies), and of predicting sub-types for “orphan” sequences (i.e. those where no sub-type is known).

Given a multiple sequence alignment and a classification of different sub-types (e.g. differences in enzyme specificity), the method exploits the differences between hidden Markov model profiles to highlight positions on the sequences that are most discerning of each sub-type. The method permits conservative substitutions, and tolerates missing data by combining alignments with amino acid exchange matrices via the construction of an HMM (Eddy, 1998). For new sequences known to be homologous to an existing family, but of unknown sub-type, the method can exploit the known sub-type classifications and associated profiles to predict sub-type. We demonstrate the method first by application to four well characterised protein families. We then perform a large scale assessment of sub-type prediction by applying the method to automatically derived sub-type groupings for 42 alignments from PFAM (Bateman et al., 1999). We discuss implications for experimental design, prediction of protein function, prediction of inter- and intra-protein distances, and applications to genome annotation.

Section snippets

Assessing the discerning value of amino acid positions

This procedure locates positions in a protein alignment that are best able to discriminate between different sub-types. Essentially this involves finding positions that are conserved within each sub-type, but that vary between the different sub-types.

Given an alignment A of sequences in family F, and the sub-types S₁, S₂, …, S_k of the sequences, we extract the sub-alignment A^j from A, corresponding to the sequences of sub-type S_j. We use the hmmbuild program of the HMMER 2.1.1 //hmmer.wustl.edu

Nucleotidyl cyclases

Nucleotidyl cyclases are a family of membrane attached or cytosolic domains that catalyse the reaction that forms a cyclic nucleotide monophosphate from a nucleotide triphosphate. The known cyclases act either on GTP (guanalyate cyclase) or ATP (adenylate cyclase). Mutations of two residue positions from Glu-Lys and Cys-Asp are known to be sufficient to change the specificity of the enzyme from GTP to ATP (Tucker et al., 1998). Mutations of several other residues near to the key Cys-Asp change

Discussion

We have presented and evaluated a method for assigning and analysing sub-types within protein sequence alignments. For four examples we have shown that the method is able to detect positions known to confer specificity in close agreement with experiment. Both on these four examples, and the 42 groupings derived from PFAM/ SWISSPROT, the method is shown to predict protein sub-types with remarkable success, even in the absence of closely related sequences of the same sub-type, and predictions are

Acknowledgements

We thank Jim Fickett and David Searls (SB) for encouragement and support. We are indebted greatly to Pankaj Agarwal (SB) for detailed help in the design of the method and the Z-score computation, in addition to other useful comments. Thanks also to Richard Copley (SB/EMBL Heidelberg), Malcolm Duckworth (SB), Mark Hurle (SB), Chris Larminie (SB), Chris Ponting (NCBI/Oxford) and Richard Jackson (University College, London) for helpful discussions. We are grateful to two referees for helpful and

References (58)

S.F. Altschul et al.
Basic local alignment search tool
J. Mol. Biol.
(1990)
Y. Azuma et al.
Model of the ran-RCC1 interaction using biochemical and docking experiments
J. Mol. Biol.
(1999)
B. Bauer et al.
Effector recognition by the small GTP-binding proteins Ras and Ral
J. Biol. Chem.
(1999)
S.E. Brenner
Errors in genome annotation
Trends Genet.
(1999)
M.D. Feese et al.
Glycerol kinase from Escherichia coli and an Ala65→Thr mutantthe crystal structures reveal conformational changes with implications for allosteric regulation
Structure
(1998)
C. Goh et al.
Co-evolution of proteins with their interaction partners
J. Mol. Biol.
(2000)
M. Gribskov et al.
Profile analysis
Methods Enzymol.
(1990)
A. Krogh et al.
Hidden Markov models in computational biology. Applications to protein modeling
J. Mol. Biol.
(1994)
O. Lichtarge et al.
An evolutionary trace method defines binding surfaces common to protein families
J. Mol. Biol.
(1996)
O. Lichtarge et al.
Identification of functional surfaces of the zinc binding domains of intracellular receptors
J. Mol. Biol.
(1997)

K.S. Makarova et al.

The Zn-peptidase superfamilyfunctional convergence after evolutionary divergence

J. Mol. Biol.

(1999)

A. Mattevi et al.

Refined crystal structure of lipoamide dehydrogenase from Azotobacter vinelandii at 2.2 angstroms resolution. A comparison with the structure of glutathione reductase

J. Mol. Biol.

(1991)

A.G. Murzin

Can homologous proteins evolve different enzymatic activities?

Trends Biochem. Sci.

(1993)

A.G. Murzin et al.

SCOPa structural classification of proteins database for the investigation of sequences and structures

J. Mol. Biol.

(1995)

F. Pazos et al.

Correlated mutations contain information about protein-protein interaction

J. Mol. Biol.

(1997)

M.G. Rossmann et al.

Exploring structural homology in proteins

J. Mol. Biol.

(1976)

R.B. Russell et al.

Structural features can be unconserved in proteins with similar foldsan analysis of side-chain to side-chain contacts, secondary structure and acessibility

J. Mol. Biol.

(1994)

R.B. Russell et al.

Supersites within superfoldsbinding site similarity in the absence of homology

J. Mol. Biol.

(1998)

R.A. Sayle et al.

RASMOLBiomolecular graphics for all

Trends Biochem. Sci.

(1995)

C.M. Smith et al.

The protein kinase resource

Trends Biochem. Sci.

(1997)

W.R. Taylor

The classification of amino acid conservation

J. Theoret. Biol.

(1986)

M.J.J. Zvelebil et al.

Prediction of protein secondary structure and active sites using the alignment of homologous sequences

J. Mol. Biol.

(1987)

M.A. Andrade

Position-specific annotation of protein function based on multiple homologs

ISMB

(1999)

M.A. Andrade et al.

Automatic extraction of keywords from scientific textapplication to the knowledge domain of protein families

Bioinformatics

(1998)

S. Atrian et al.

Shaping of Drosophila alcohol dehydrogenase through evolutionrelationship with enzyme functionality

J. Mol. Evol.

(1998)

A. Bairoch et al.

The SWISSPROT protein sequence data bank and its new supplement TrEMBL in 1999

Nucl. Acids Res.

(1999)

G.J. Barton

ALSCRIPTa tool to format multiple sequence alignments

Protein Eng.

(1993)

A. Bateman et al.

Pfam 3.11313 multiple alignments and profile HMMs match the majority of proteins

Nucl. Acids Res.

(1999)

E. Birney et al.

PairWise and SearchWisefinding the optimal alignment in a simultaneous comparison of a protein profile against all DNA translation frames

Nucl. Acids. Res.

(1996)

Cited by (224)

Computational prediction of protein functional sites—Applications in biotechnology and biomedicine
2022, Advances in Protein Chemistry and Structural Biology
Citation Excerpt :
They differ on the way they look for an “optimal” subfamily classification from where to extract SDPs. One possibility is to explore different partitions (for example following the phylogenetic tree, as in ET), and look for the “optimal” one according with some criterion, such as that delivering the largest number of SDPs (Hannenhalli & Russell, 2000). Another possibility is to split the MSA into orthologous groups, as these tend to be functionally homogenous (Mirny & Gelfand, 2002).
There are many computational approaches for predicting protein functional sites based on different sequence and structural features. These methods are essential to cope with the sequence deluge that is filling databases with uncharacterized protein sequences. They complement the more expensive and time-consuming experimental approaches by pointing them to possible candidate positions. In many cases they are jointly used to characterize the functional sites in proteins of biotechnological and biomedical interest and eventually modify them for different purposes. There is a clear trend towards approaches based on machine learning and those using structural information, due to the recent developments in these areas. Nevertheless, “classic” methods based on sequence and evolutionary features are still playing an important role as these features are strongly related to functionality. In this review, the main approaches for predicting general functional sites in a protein are discussed, with a focus on sequence-based approaches.
Molecular modeling studies of the effects of withaferin A and its derivatives against oncoproteins associated with breast cancer stem cell activity
2021, Process Biochemistry
Citation Excerpt :
Hence, in the present study, functionally or structurally important sites on the protein surfaces of each target macromolecule were predicted, regarding their tendency to bind to the ligands based on their amino acid compositions. The active sites predicted for every target receptor were found to be consistent with the findings of previous studies [35–38]. Withaferin A is a steroidal lactone with a withanolide type-A skeleton.
Cancer stem cells (CSCs) in breast cancers are considered to be a major cause of tumor recurrence and metastasis, which lead to chemotherapeutic failures and poor survival rates among breast cancer patients. Hence, this study attempted to evaluate the anti-tumor potential of a bioactive withanolide compound, withaferin A (WFA), against the molecular targets that maintain the stemness of breast cancer stem cells, using in silico molecular modeling approach. The crystal structures of potential breast cancer stem cell targets—HSP70, HSP90, NFκB, and BRCA1 were retrieved from PDB. The 2D structures of WFA, its analogs, along with doxorubicin and vinblastine were retrieved from the PubChem database. WFA and withaferin A diacetate exhibited strong receptor-ligand interactions for the target macromolecules. Their binding energy values ranged between –55.241 and –40.250 kcal/mol. Compared to doxorubicin, WFA markedly inhibited the growth and decreased the viability of MCF-7-derived breast cancer stem cells, with an IC₅₀ value of 1.476 μM for 48 h. Therefore, our results indicate that WFA and its derivatives could be promising anti-tumor agents against breast cancer stem cells. Further in vitro studies are warranted in order to gain a greater understanding of the action of this compound in a physiological setting.
Machine learning reveals sequence-function relationships in family 7 glycoside hydrolases
2021, Journal of Biological Chemistry
Family 7 glycoside hydrolases (GH7) are among the principal enzymes for cellulose degradation in nature and industrially. These enzymes are often bimodular, including a catalytic domain and carbohydrate-binding module (CBM) attached via a flexible linker, and exhibit an active site that binds cello-oligomers of up to ten glucosyl moieties. GH7 cellulases consist of two major subtypes: cellobiohydrolases (CBH) and endoglucanases (EG). Despite the critical importance of GH7 enzymes, there remain gaps in our understanding of how GH7 sequence and structure relate to function. Here, we employed machine learning to gain data-driven insights into relationships between sequence, structure, and function across the GH7 family. Machine-learning models, trained only on the number of residues in the active-site loops as features, were able to discriminate GH7 CBHs and EGs with up to 99% accuracy, demonstrating that the lengths of loops A4, B2, B3, and B4 strongly correlate with functional subtype across the GH7 family. Classification rules were derived such that specific residues at 42 different sequence positions each predicted the functional subtype with accuracies surpassing 87%. A random forest model trained on residues at 19 positions in the catalytic domain predicted the presence of a CBM with 89.5% accuracy. Our machine learning results recapitulate, as top-performing features, a substantial number of the sequence positions determined by previous experimental studies to play vital roles in GH7 activity. We surmise that the yet-to-be-explored sequence positions among the top-performing features also contribute to GH7 functional variation and may be exploited to understand and manipulate function.
Illuminating G-Protein-Coupling Selectivity of GPCRs
2019, Cell
Heterotrimetic G proteins consist of four subfamilies (G_s, G_i/o, G_q/11, and G_12/13) that mediate signaling via G-protein-coupled receptors (GPCRs), principally by receptors binding Gα C termini. G-protein-coupling profiles govern GPCR-induced cellular responses, yet receptor sequence selectivity determinants remain elusive. Here, we systematically quantified ligand-induced interactions between 148 GPCRs and all 11 unique Gα subunit C termini. For each receptor, we probed chimeric Gα subunit activation via a transforming growth factor-α (TGF-α) shedding response in HEK293 cells lacking endogenous G_q/11 and G_12/13 proteins, and complemented G-protein-coupling profiles through a NanoBiT-G-protein dissociation assay. Interrogation of the dataset identified sequence-based coupling specificity features, inside and outside the transmembrane domain, which we used to develop a coupling predictor that outperforms previous methods. We used the predictor to engineer designer GPCRs selectively coupled to G₁₂. This dataset of fine-tuned signaling mechanisms for diverse GPCRs is a valuable resource for research in GPCR signaling.
Information theoretic measures and mutagenesis identify a novel linchpin residue involved in substrate selection within the nucleotide-binding domain of an ABCG family exporter Cdr1p
2019, Archives of Biochemistry and Biophysics
Citation Excerpt :
While analyzing the sequences of NBDs of key ABC exporters we noticed a highly conserved glutamine in the H-loop of ABCG/PDR members. When information theoretic measures were employed on NBD sequences of fungal PDR transporters which are part of the ABCG family, a very high Cumulative relative entropy (CRE) score was conferred to this residue position implying its family-specific functional significance [13]. In view of the available literature pertaining to the importance of H-loop in the functioning of ABC transporters, it is evident that while in case of ABC transporters such as HlyB, the H-loop histidine plays a key role in ATP hydrolysis (“catalytic dyad model”), the same is not true for the PDR transporters [14–17].
ABC transporters are membrane-bound pumps composed of two major domains: the transmembrane domain(s) (TMDs) and the nucleotide-binding domain(s) (NBDs). Sequence analyses of the NBDs of key ABC exporters revealed a residue position within the H-loop to be differentially conserved in the ABCG family, wherein there lies glutamine instead of positively charged arginine/lysine as in non-ABCG members. Consequently, contrasting NBD sequences of fungal Pleiotropic Drug Resistance transporters (PDR/ABCG) with that of Cholesterol/Phospholipid and Retinal (CPR/ABCA) Flippase family revealed a high Cumulative Relative Entropy (CRE) score of this residue position implying its family-specific functional significance. Further, substitution of the glutamine by arginine in both the NBDs of a representative PDR/ABCG member, (Candida drug resistance 1 protein) Cdr1p led to selective susceptibility of the Saccharomyces cerevisiae strains overexpressing the corresponding mutant proteins (Q362R and Q1060R) towards antifungal substrates without any impact on the ATPase activity. Consistent with the findings from previous studies on H-loop motif of fungal PDR transporters, the current report points towards a role of the glutamine residue within both canonical and divergent H-loop of Cdr1p in conferring substrate selection in a precisely identical manner.
The ancestry of Antennapedia-like homeobox genes
2023, bioRxiv

View all citing articles on Scopus

¹: Edited by J. Thornton

²: Present address: R. B. Russell; EMBL, Meyerhofstrasse 1, D-69117 Heidelberg, Germany.

View full text

Journal of Molecular Biology

Regular articleAnalysis and prediction of functional sub-types from protein sequence alignments1

Abstract

Introduction

Section snippets

Assessing the discerning value of amino acid positions

Nucleotidyl cyclases

Discussion

Acknowledgements

J. Mol. Biol.

J. Mol. Biol.

J. Biol. Chem.

Trends Genet.

Structure

J. Mol. Biol.

Methods Enzymol.

J. Mol. Biol.

J. Mol. Biol.

J. Mol. Biol.

J. Mol. Biol.

J. Mol. Biol.

Trends Biochem. Sci.

J. Mol. Biol.

J. Mol. Biol.

J. Mol. Biol.

J. Mol. Biol.

J. Mol. Biol.

Trends Biochem. Sci.

Trends Biochem. Sci.

J. Theoret. Biol.

J. Mol. Biol.

Position-specific annotation of protein function based on multiple homologs

ISMB

Automatic extraction of keywords from scientific textapplication to the knowledge domain of protein families

Bioinformatics

Shaping of Drosophila alcohol dehydrogenase through evolutionrelationship with enzyme functionality

J. Mol. Evol.

The SWISSPROT protein sequence data bank and its new supplement TrEMBL in 1999

Nucl. Acids Res.

ALSCRIPTa tool to format multiple sequence alignments

Protein Eng.

Pfam 3.11313 multiple alignments and profile HMMs match the majority of proteins

Nucl. Acids Res.

PairWise and SearchWisefinding the optimal alignment in a simultaneous comparison of a protein profile against all DNA translation frames

Nucl. Acids. Res.

Regular article
Analysis and prediction of functional sub-types from protein sequence alignments¹