Using Orthologous and Paralogous Proteins to Identify Specificity-determining Residues in Bacterial Transcription Factors

https://doi.org/10.1016/S0022-2836(02)00587-9Get rights and content

Abstract

Concepts of orthology and paralogy are become increasingly important as whole-genome comparison allows their identification in complete genomes. Functional specificity of proteins is assumed to be conserved among orthologs and is different among paralogs. We used this assumption to identify residues which determine specificity of protein–DNA and protein–ligand recognition. Finding such residues is crucial for understanding mechanisms of molecular recognition and for rational protein and drug design. Assuming conservation of specificity among orthologs and different specificity of paralogs, we identify residues that correlate with this grouping by specificity. The method is taking advantage of complete genomes to find multiple orthologs and paralogs. The central part of this method is a procedure to compute statistical significance of the predictions. The procedure is based on a simple statistical model of protein evolution. When applied to a large family of bacterial transcription factors, our method identified 12 residues that are presumed to determine the protein–DNA and protein–ligand recognition specificity. Structural analysis of the proteins and available experimental results strongly support our predictions. Our results suggest new experiments aimed at rational re-design of specificity in bacterial transcription factors by a minimal number of mutations.

Introduction

The concepts of orthology and paralogy were originally introduced by Walter Fitch in 19701., 2. and recently became a subject of active discussion.3., 4., 5., 6. Briefly, orthologs are genes in different organisms which are direct evolutionary counterparts of each other. Orthologs were inherited through speciation, as opposed to paralogs which are genes in the same organism which evolved by gene duplication.6., 3., 2. After duplication, paralogous proteins experience weaker evolutionary pressure and their specificity diverges leading to emerging of new specificities and functions. Orthologous proteins, on the contrary, are believed to be under similar regulation, have the same function and usually the same specificity in close organisms.7., 8., 9. In other words, both paralogs and orthologs are assumed to have similar general biochemical functions, while orthologs are also believed to have the same specificity. Although the validity of these assumptions is yet to be verified experimentally, numerous case studies support such views.6., 10. Several methods have been developed to find orthologous proteins in complete genomes.8., 11. The assumption of similar regulation of orthologous proteins was productively used by several groups to identify common regulatory motifs upstream of orthologous proteins.12., 9., 13., 14., 15. In this study we exploit another property of orthologs: similar specificity, as contrasted by different specificities of paralogs.

If the above assumption is correct, grouping by orthology becomes grouping of proteins by specificity. Here we developed a method, which uses such grouping to identify amino acid residues that determine the protein specificity. Specificity-determining residues can be very hard to find even when the structure of a protein or a complex is available, since very few amino acid residues provide specific recognition (see below). Extensive site-directed mutagenesis is used to find such residues, though frequently complicated by a need to discriminate between specific and non-specific effects of a mutation. Computational prediction of the specificity determinants can substantially reduce experimental efforts and provide guidance for rational re-design of protein function.16., 17.

Our method relies on the above assumption that binding specificity is conserved among orthologous proteins and is different in paralogous proteins. The idea of our method is (1) to start from a family of paralogs in one genome, find orthologs for each member of the family in other genomes and (2) identify residues that can better discriminate between these orthologous (specificity) groups.

Assumption of specificity conserved among orthologs is not necessarily true.18 However, mislabeling of orthologs and paralogs and errors in specificity assignments may mask some specificity-determining positions, but will not lead to spurious predictions of specificity determinants. Thus any noise decreases sensitivity of the method, but does not lead to appearance of false positives. In the case considered here, the orthology relationships were simple to resolve. Given the increasing amount of genomics data and the emergence of genome analysis techniques such as positional clustering and regulon identification, the possibility to analyze more complicated cases will increase steadily.

In its second part the method is similar to techniques of hierarchical analysis of residue conservation,19 PCA in the sequence space,20 evolutionary trace analysis21., 22. and prediction of functional sub-types.23 All these techniques use multiple sequence alignment (MSA) to group proteins into sub-groups based on sequence similarity and then identify residues that confer the unique features of each sub-group. Lapidot et al.24 compared the variability of positions in aligned olfactory receptors of human and mouse, and identified positions conserved in orthologs, but varying in paralogs. A complementary structure-based approach was developed by Johnson & Church to predict protein function using a prior knowledge of the binding-site residues.25 In contrast to other methods, our method relies on the definition of sub-families based on gene orthology and a rigorous statistical procedure to predict specificity-determining residues. Our statistical procedure determines whether positions in the MSA can discriminate between functional sub-families better than the sequence similarity. Residues that satisfy these criteria are predicted to be specificity-determining. Primarily, our method does not require the knowledge of the protein structure and can tolerate certain substitutions within a sub-family.

Here we present results of our analysis applied to the LacI/PurR family of bacterial transcription factors. The main result of this study is that among 12 identified specificity-determining residues, three are binding the DNA and eight are binding the ligand in the ligand-binding domain. The available experimental information supports the critical role of the identified DNA-binding residues in determining the specificity of the DNA recognition. Analysis developed here is not limited to DNA-binding proteins and can be applied to any family of proteins where the clear orthology or functional grouping can be established.

Section snippets

Specificity determinants of the LacI family

We have chosen the LacI family for our analysis because (1) it is one of the largest families of bacterial transcription factors, (2) the availability of complete bacterial genomes has allowed us to resolve orthology by positional analysis (see Materials and Methods), and (3) available experimental26., 27., 28. and structural29., 30. information can be used to assess our predictions.

Figure 1 presents the mutual information Ii, the expected mutual information Iexp and the probability P(I)

Discussion

In this study we suggested a method to identify specificity-determining residues in proteins. We applied it to one of the largest family of bacterial transcription factors and obtained a set of putative specificity-determining residues. Mapping of these residues onto a protein structure showed that most of identified residues belong to two spatial clusters. Residues of one cluster bind the DNA, while residues of the other cluster form a ligand pocket of the protein. This finding is consistent

Materials and Methods

The key idea of this method is to compare paralogous and orthologous proteins from the same family. As a rule, all paralogous and orthologous proteins have the same biochemical function. Paralogous proteins, however, usually have different specificity as they act on different targets, e.g. bind different ligand or different sites on the DNA. Orthologous proteins, in contrast, have the same specificity in different organisms, e.g. bind the same ligand and similar DNA sites in related genomes.

Acknowledgements

We are grateful to Alexander van Oudenaarden for helpful discussions and initiation of experimental work to test our predictions. We also acknowledge useful comments made by Richard Goldstein and Eugene Shakhnovich while discussing this work. L.M. is partially supported by William F. Milton Fund and John F. and Virginia B. Taplin Award. M.G. is partially supported by grants from INTAS (99-1476), Howard Hughes Medical Institute (55000309), and the Ludwig Cancer Research Institute. We are

References (51)

  • W. Fitch

    Distinguishing homologous from analogous proteins

    Syst. Zool.

    (1970)
  • E. Koonin

    An apology for orthologs—or brave new memes

    Genome Biol.

    (2001)
  • G. Petsko

    Homologuephobia

    Genome Biol.

    (2001)
  • R. Jensen

    Orthologs and paralogs—we need to get it right

    Genome Biol.

    (2001)
  • J. Gerlt et al.

    Can sequence determine function?

    Genome Biol.

    (2000)
  • K. Makarova et al.

    Comparative genomics of the archaea (Euryarchaeota): evolution of conserved protein families, the stable core, and the variable shell

    Genome Res.

    (1999)
  • R. Tatusov et al.

    The cog database: a tool for genome-scale analysis of protein functions and evolution

    Nucl. Acids Res.

    (2000)
  • M. Gelfand et al.

    Prediction of transcription regulatory sites in archaea by a comparative genomic approach

    Nucl. Acids Res.

    (2000)
  • J. Gerlt et al.

    Divergent evolution of enzymatic function: mechanistically diverse superfamilies and functionally distinct suprafamilies

    Annu. Rev. Biochem.

    (2001)
  • R. Overbeek et al.

    The use of gene clusters to infer functional coupling

    Proc. Natl Acad. Sci. USA

    (1999)
  • H. Salgado et al.

    Regulondb (version 3.2): transcriptional regulation and operon organization in Escherichia coli k-12

    Nucl. Acids Res.

    (2001)
  • L. McCue et al.

    Phylogenetic footprinting of transcription factor binding sites in proteobacterial genomes

    Nucl. Acids Res.

    (2001)
  • G. Hertz et al.

    Identifying DNA and protein patterns with statistically significant a ligments of multiple sequences

    Bioinformatics

    (1999)
  • K. Tan et al.

    A comparative genomics approach to prediction of new members of regulons

    Genome Res.

    (2001)
  • A. Fersht

    Structure and Mechanism in Protein Science: A Guide to Enzyme Catalysis and Protein Folding

    (1999)
  • Cited by (120)

    • Computational prediction of protein functional sites—Applications in biotechnology and biomedicine

      2022, Advances in Protein Chemistry and Structural Biology
      Citation Excerpt :

      One possibility is to explore different partitions (for example following the phylogenetic tree, as in ET), and look for the “optimal” one according with some criterion, such as that delivering the largest number of SDPs (Hannenhalli & Russell, 2000). Another possibility is to split the MSA into orthologous groups, as these tend to be functionally homogenous (Mirny & Gelfand, 2002). It is also possible, using heuristic approaches, to explore a very large number of possible subfamily groupings (not only those coherent with a given phylogenetic tree) and report that which maximizes some criterion, together with its associated SDPs.

    • Transcription: Lac operon regulation

      2021, Encyclopedia of Biological Chemistry: Third Edition
    • Binding of single-mutant epidermal growth factor (EGF) ligands alters the stability of the EGF receptor dimer and promotes growth signaling

      2021, Journal of Biological Chemistry
      Citation Excerpt :

      The code used in the analysis of the DIRpred score and plots and the data used in this article are shared on GitHub: https://github.com/oist/DIRpred. The statistical analysis of EGF DIRpred scores was done as in the study by Mirny and Gelfand (55) with some modifications. The key point is that a null-hypothesis dataset should take into account the higher similarity found between orthologs, rather than between paralogs.

    • Bioinformatic analysis of subfamily-specific regions in 3D-structures of homologs to study functional diversity and conformational plasticity in protein superfamilies

      2021, Computational and Structural Biotechnology Journal
      Citation Excerpt :

      Such SSPs/SDPs have both fundamental and practical value: they can help to understand how enzymes perform their natural functions, and can also be selected as hotspots for protein engineering experiments or as key residues involved in selective accommodation of ligand to assist drug discovery [7,8]. Interest in the analysis of functionally important specific positions is a long-standing trend in computational biology: the concept was introduced in late 1990s [9,10], the first systematic approach to identify such positions/residues in protein sequences was published in 2002 [11], followed by a variety of improvements to increase the accuracy of predictions and facilitate the ease-of-use in the daily routine [3,5,7,12,13]. In particular, the original Zebra/Zebra2 approach [14] to identify SSPs/SDPs in multiple sequence alignments was recognized as a tool [15,16] to help studying structure–function relationship in protein superfamilies [17–19], and used to assist experimental design of improved enzymes [20] and ligand binding specificity [21].

    • Effective estimation of the minimum number of amino acid residues required for functional divergence between duplicate genes

      2017, Molecular Phylogenetics and Evolution
      Citation Excerpt :

      Predicting functional divergence in protein families and identifying potential functionally divergent sites on a large scale is one of the highest priorities in postgenomic studies (Chagoyen et al., 2015; Nguyen Ba et al., 2014; Swint-Kruse, 2016). It is widely accepted that the importance of gene function can be measured in terms of the functional constraints of the protein sequence (Abhiman et al., 2006; Arnau et al., 2006; Mirny and Gelfand, 2002; Naylor and Gerstein, 2000). For instance, highly conserved position among homologous sites indicates that the residues are involved in a common protein family function (Bharatham et al., 2011; del Sol et al., 2003).

    View all citing articles on Scopus
    View full text