Journal of Molecular Biology
Using Orthologous and Paralogous Proteins to Identify Specificity-determining Residues in Bacterial Transcription Factors
Introduction
The concepts of orthology and paralogy were originally introduced by Walter Fitch in 19701., 2. and recently became a subject of active discussion.3., 4., 5., 6. Briefly, orthologs are genes in different organisms which are direct evolutionary counterparts of each other. Orthologs were inherited through speciation, as opposed to paralogs which are genes in the same organism which evolved by gene duplication.6., 3., 2. After duplication, paralogous proteins experience weaker evolutionary pressure and their specificity diverges leading to emerging of new specificities and functions. Orthologous proteins, on the contrary, are believed to be under similar regulation, have the same function and usually the same specificity in close organisms.7., 8., 9. In other words, both paralogs and orthologs are assumed to have similar general biochemical functions, while orthologs are also believed to have the same specificity. Although the validity of these assumptions is yet to be verified experimentally, numerous case studies support such views.6., 10. Several methods have been developed to find orthologous proteins in complete genomes.8., 11. The assumption of similar regulation of orthologous proteins was productively used by several groups to identify common regulatory motifs upstream of orthologous proteins.12., 9., 13., 14., 15. In this study we exploit another property of orthologs: similar specificity, as contrasted by different specificities of paralogs.
If the above assumption is correct, grouping by orthology becomes grouping of proteins by specificity. Here we developed a method, which uses such grouping to identify amino acid residues that determine the protein specificity. Specificity-determining residues can be very hard to find even when the structure of a protein or a complex is available, since very few amino acid residues provide specific recognition (see below). Extensive site-directed mutagenesis is used to find such residues, though frequently complicated by a need to discriminate between specific and non-specific effects of a mutation. Computational prediction of the specificity determinants can substantially reduce experimental efforts and provide guidance for rational re-design of protein function.16., 17.
Our method relies on the above assumption that binding specificity is conserved among orthologous proteins and is different in paralogous proteins. The idea of our method is (1) to start from a family of paralogs in one genome, find orthologs for each member of the family in other genomes and (2) identify residues that can better discriminate between these orthologous (specificity) groups.
Assumption of specificity conserved among orthologs is not necessarily true.18 However, mislabeling of orthologs and paralogs and errors in specificity assignments may mask some specificity-determining positions, but will not lead to spurious predictions of specificity determinants. Thus any noise decreases sensitivity of the method, but does not lead to appearance of false positives. In the case considered here, the orthology relationships were simple to resolve. Given the increasing amount of genomics data and the emergence of genome analysis techniques such as positional clustering and regulon identification, the possibility to analyze more complicated cases will increase steadily.
In its second part the method is similar to techniques of hierarchical analysis of residue conservation,19 PCA in the sequence space,20 evolutionary trace analysis21., 22. and prediction of functional sub-types.23 All these techniques use multiple sequence alignment (MSA) to group proteins into sub-groups based on sequence similarity and then identify residues that confer the unique features of each sub-group. Lapidot et al.24 compared the variability of positions in aligned olfactory receptors of human and mouse, and identified positions conserved in orthologs, but varying in paralogs. A complementary structure-based approach was developed by Johnson & Church to predict protein function using a prior knowledge of the binding-site residues.25 In contrast to other methods, our method relies on the definition of sub-families based on gene orthology and a rigorous statistical procedure to predict specificity-determining residues. Our statistical procedure determines whether positions in the MSA can discriminate between functional sub-families better than the sequence similarity. Residues that satisfy these criteria are predicted to be specificity-determining. Primarily, our method does not require the knowledge of the protein structure and can tolerate certain substitutions within a sub-family.
Here we present results of our analysis applied to the LacI/PurR family of bacterial transcription factors. The main result of this study is that among 12 identified specificity-determining residues, three are binding the DNA and eight are binding the ligand in the ligand-binding domain. The available experimental information supports the critical role of the identified DNA-binding residues in determining the specificity of the DNA recognition. Analysis developed here is not limited to DNA-binding proteins and can be applied to any family of proteins where the clear orthology or functional grouping can be established.
Section snippets
Specificity determinants of the LacI family
We have chosen the LacI family for our analysis because (1) it is one of the largest families of bacterial transcription factors, (2) the availability of complete bacterial genomes has allowed us to resolve orthology by positional analysis (see Materials and Methods), and (3) available experimental26., 27., 28. and structural29., 30. information can be used to assess our predictions.
Figure 1 presents the mutual information Ii, the expected mutual information Iexp and the probability P(I)
Discussion
In this study we suggested a method to identify specificity-determining residues in proteins. We applied it to one of the largest family of bacterial transcription factors and obtained a set of putative specificity-determining residues. Mapping of these residues onto a protein structure showed that most of identified residues belong to two spatial clusters. Residues of one cluster bind the DNA, while residues of the other cluster form a ligand pocket of the protein. This finding is consistent
Materials and Methods
The key idea of this method is to compare paralogous and orthologous proteins from the same family. As a rule, all paralogous and orthologous proteins have the same biochemical function. Paralogous proteins, however, usually have different specificity as they act on different targets, e.g. bind different ligand or different sites on the DNA. Orthologous proteins, in contrast, have the same specificity in different organisms, e.g. bind the same ligand and similar DNA sites in related genomes.
Acknowledgements
We are grateful to Alexander van Oudenaarden for helpful discussions and initiation of experimental work to test our predictions. We also acknowledge useful comments made by Richard Goldstein and Eugene Shakhnovich while discussing this work. L.M. is partially supported by William F. Milton Fund and John F. and Virginia B. Taplin Award. M.G. is partially supported by grants from INTAS (99-1476), Howard Hughes Medical Institute (55000309), and the Ludwig Cancer Research Institute. We are
References (51)
Homology a personal view on some of the problems
Trends Genet.
(2000)- et al.
An evolutionary trace method defines binding surfaces common to protein families
J. Mol. Biol.
(1996) - et al.
Identification of functional surfaces of the zinc binding domains of intracellular receptors
J. Mol. Biol.
(1997) - et al.
Analysis and prediction of functional sub-types from protein sequence alignments
J. Mol. Biol.
(2000) - et al.
Mouse–human orthology relationships in an olfactory receptor gene cluster
Genomics
(2001) - et al.
The roles of residues 5 and 9 of the recognition helix of lac repressor in lac operator binding
J. Mol. Biol.
(1991) - et al.
The role of lysine 55 in determining the specificity of the purine repressor for its operators through minor groove interactions
J. Mol. Biol.
(1999) - et al.
The X-ray structure of the purr–guanine–purf operator complex reveals the contributions of complementary electrostatic surfaces and a water-mediated
J. Biol. Chem.
(1997) - et al.
Crystallographic analysis of lac repressor bound to natural operator o1
J. Mol. Biol.
(2001) - et al.
Supersites within superfolds. Binding site similarity in the absence of homology
J. Mol. Biol.
(1998)
Distinguishing homologous from analogous proteins
Syst. Zool.
An apology for orthologs—or brave new memes
Genome Biol.
Homologuephobia
Genome Biol.
Orthologs and paralogs—we need to get it right
Genome Biol.
Can sequence determine function?
Genome Biol.
Comparative genomics of the archaea (Euryarchaeota): evolution of conserved protein families, the stable core, and the variable shell
Genome Res.
The cog database: a tool for genome-scale analysis of protein functions and evolution
Nucl. Acids Res.
Prediction of transcription regulatory sites in archaea by a comparative genomic approach
Nucl. Acids Res.
Divergent evolution of enzymatic function: mechanistically diverse superfamilies and functionally distinct suprafamilies
Annu. Rev. Biochem.
The use of gene clusters to infer functional coupling
Proc. Natl Acad. Sci. USA
Regulondb (version 3.2): transcriptional regulation and operon organization in Escherichia coli k-12
Nucl. Acids Res.
Phylogenetic footprinting of transcription factor binding sites in proteobacterial genomes
Nucl. Acids Res.
Identifying DNA and protein patterns with statistically significant a ligments of multiple sequences
Bioinformatics
A comparative genomics approach to prediction of new members of regulons
Genome Res.
Structure and Mechanism in Protein Science: A Guide to Enzyme Catalysis and Protein Folding
Cited by (120)
Computational prediction of protein functional sites—Applications in biotechnology and biomedicine
2022, Advances in Protein Chemistry and Structural BiologyCitation Excerpt :One possibility is to explore different partitions (for example following the phylogenetic tree, as in ET), and look for the “optimal” one according with some criterion, such as that delivering the largest number of SDPs (Hannenhalli & Russell, 2000). Another possibility is to split the MSA into orthologous groups, as these tend to be functionally homogenous (Mirny & Gelfand, 2002). It is also possible, using heuristic approaches, to explore a very large number of possible subfamily groupings (not only those coherent with a given phylogenetic tree) and report that which maximizes some criterion, together with its associated SDPs.
Transcription: Lac operon regulation
2021, Encyclopedia of Biological Chemistry: Third EditionBinding of single-mutant epidermal growth factor (EGF) ligands alters the stability of the EGF receptor dimer and promotes growth signaling
2021, Journal of Biological ChemistryCitation Excerpt :The code used in the analysis of the DIRpred score and plots and the data used in this article are shared on GitHub: https://github.com/oist/DIRpred. The statistical analysis of EGF DIRpred scores was done as in the study by Mirny and Gelfand (55) with some modifications. The key point is that a null-hypothesis dataset should take into account the higher similarity found between orthologs, rather than between paralogs.
Bioinformatic analysis of subfamily-specific regions in 3D-structures of homologs to study functional diversity and conformational plasticity in protein superfamilies
2021, Computational and Structural Biotechnology JournalCitation Excerpt :Such SSPs/SDPs have both fundamental and practical value: they can help to understand how enzymes perform their natural functions, and can also be selected as hotspots for protein engineering experiments or as key residues involved in selective accommodation of ligand to assist drug discovery [7,8]. Interest in the analysis of functionally important specific positions is a long-standing trend in computational biology: the concept was introduced in late 1990s [9,10], the first systematic approach to identify such positions/residues in protein sequences was published in 2002 [11], followed by a variety of improvements to increase the accuracy of predictions and facilitate the ease-of-use in the daily routine [3,5,7,12,13]. In particular, the original Zebra/Zebra2 approach [14] to identify SSPs/SDPs in multiple sequence alignments was recognized as a tool [15,16] to help studying structure–function relationship in protein superfamilies [17–19], and used to assist experimental design of improved enzymes [20] and ligand binding specificity [21].
Effective estimation of the minimum number of amino acid residues required for functional divergence between duplicate genes
2017, Molecular Phylogenetics and EvolutionCitation Excerpt :Predicting functional divergence in protein families and identifying potential functionally divergent sites on a large scale is one of the highest priorities in postgenomic studies (Chagoyen et al., 2015; Nguyen Ba et al., 2014; Swint-Kruse, 2016). It is widely accepted that the importance of gene function can be measured in terms of the functional constraints of the protein sequence (Abhiman et al., 2006; Arnau et al., 2006; Mirny and Gelfand, 2002; Naylor and Gerstein, 2000). For instance, highly conserved position among homologous sites indicates that the residues are involved in a common protein family function (Bharatham et al., 2011; del Sol et al., 2003).