Journal of Molecular Biology
Volume 287, Issue 5, 16 April 1999, Pages 1023-1040
Journal home page for Journal of Molecular Biology

Regular article
Gleaning non-trivial structural, functional and evolutionary information about proteins by iterative database searches1

https://doi.org/10.1006/jmbi.1999.2653Get rights and content

Abstract

Using a number of diverse protein families as test cases, we investigate the ability of the recently developed iterative sequence database search method, PSI-BLAST, to identify subtle relationships between proteins that originally have been deemed detectable only at the level of structure-structure comparison. We show that PSI-BLAST can detect many, though not all, of such relationships, but the success critically depends on the optimal choice of the query sequence used to initiate the search. Generally, there is a correlation between the diversity of the sequences detected in the first pass of database screening and the ability of a given query to detect subtle relationships in subsequent iterations. Accordingly, a thorough analysis of protein superfamilies at the sequence level is necessary in order to maximize the chances of gleaning non-trivial structural and functional inferences, as opposed to a single search, initiated, for example, with the sequence of a protein whose structure is available. This strategy is illustrated by several findings, each of which involves an unexpected structural prediction: (i) a number of previously undetected proteins with the HSP70-actin fold are identified, including a highly conserved and nearly ubiquitous family of metal-dependent proteases (typified by bacterial O-sialoglycoprotease) that represent an adaptation of this fold to a new type of enzymatic activity; (ii) we show that, contrary to the previous conclusions, ATP-dependent and NAD-dependent DNA ligases are confidently predicted to possess the same fold; (iii) the C-terminal domain of 3-phosphoglycerate dehydrogenase, which binds serine and is involved in allosteric regulation of the enzyme activity, is shown to typify a new superfamily of ligand-binding, regulatory domains found primarily in enzymes and regulators of amino acid and purine metabolism; (iv) the immunoglobulin-like DNA-binding domain previously identified in the structures of transcription factors NFκB and NFAT is shown to be a member of a distinct superfamily of intracellular and extracellular domains with the immunoglobulin fold; and (v) the Rag-2 subunit of the V-D-J recombinase is shown to contain a kelch-type β-propeller domain which rules out its evolutionary relationship with bacterial transposases.

Introduction

Protein structure determination inevitably lags far behind the explosive quantitative and qualitative (thanks to the determination of genome sequences of taxonomicaly diverse organisms) growth of sequence databases. It has been observed, however, that newly determined structures increasingly tend to fall into already known structural foldsMurzin 1996, Murzin 1998. This indicates that the number of folds (the basic types of globular domains) is finite and is unlikely to exceed a few thousandChothia 1992, Orengo et al 1994. Moreover, while it is difficult to estimate the total number of folds with a greater precision, it seems clear that for most of the widespread folds, representative structures are already available. Thus, it is highly probable that for any new protein sequence that does not have a significant compositional bias and, accordingly, is likely to form a globular domain(s)(Wootton, 1994), a structure with the same fold is present in the protein data bank (PDB;Bernsteinet al., 1977). In order to obtain structural information about a given protein domain, all one needs is to establish a reliable alignment with the sequence of one of the domains with a known structure. More frequently than not, however, this task is not trivial. Major transitions in the evolution of life appear to have been accompanied (or in part driven) by the origin of new protein families from preexisting ones when sequences rapidly diverge, while the structure remains basically conserved(Doolittle, 1995). This erosion of sequence information in the course of evolution is the major obstacle in making structural predictions using homology inferred from sequence similarity. Accordingly, a number of unexpected connections between protein families originally thought to be unrelated have been recently established by comparison of experimentally determined three-dimensional structuresHolm and Sander 1996, Holm and Sander 1997, Murzin 1996, Murzin 1998, Murzin and Bateman 1997.

In order to maximize the rate of structural prediction from protein sequences, increasing sensitivity of sequence comparison methods is critical. The subtle relationships discovered by structure-structure comparison may be considered the golden standard for sequence analysis methods. Those methods that are sufficiently powerful to detect at least some of the connections originally perceived as “structural only” should be expected to routinely produce non-trivial structural predictions. Most of the advanced sequence database search methods utilize information contained in multiple alignments. The recently developed PSI (Position-Specific Iterating)-BLAST method constructs a multiple alignment from the BLAST hits, converts it into a position-specific weight matrix and iterates the search using this matrix as the queryAltschul et al 1997, Altschul and Koonin 1998. Several in-depth studies of protein families as well as benchmarking experiments suggest that given the new level of protein sequence diversity coming from whole genome sequencing, this method may significantly increase our ability to detect subtle sequence similarities and, in particular, to make non-trivial structure predictions (Aravind and Koonin 1998, Aravind et al 1998; Huyneyet al., 1998;Mushegian et al 1997, Rychlewski et al 1998, Wolf et al 1999).

Here, using several previously described cases of relationships between protein families that have been deemed to be detectable only by structure-structure comparison, we show that with appropriate starting points, PSI-BLAST is capable of detecting, at the sequence level, many of these subtle similarities. We demonstrate that typically, the best starting points for the iterative search are those that produce the greatest diversity of hits in the first BLAST pass. We then investigate several new examples of unexpected structural inferences for highly conserved protein domains that have important functional and evolutionary implications.

Section snippets

The strategy for protein superfamily analysis using PSI-BLAST

For assessing the ability of PSI-BLAST to detect subtle similarity between proteins, we chose several cases where a relationship originally has been discovered by structure-structure comparison and has been deemed undetectable at the sequence level(Table 1). The examples include the classical case of structural similarity between actins, the HSP70 class of molecular chaperones and sugar kinases(Borket al., 1992), as well as more recently described relationships, such as those between antibiotic

Databases

Standard database searches were performed using the non-redundant (NR) protein database at the NCBI. The structural databases used here were PDB and SCOP (Structural Classification of Proteins;Murzin et al 1995, Hubbard et al 1999). SCOP employs a manual process to identify structural relationships between proteins and classifies them into a four-level hierarchy. This hierarchy from top to bottom reflects the protein structural class in terms of secondary structural elements (α-helices and

Acknowledgements

We are grateful to Michael Rozanov for his participation in the early stage of the HSP70 superfamily analysis.

1999 U.S. Government

References (91)

  • M Huynen et al.

    Homology-based fold predictions forMycoplasma genitaliumproteins

    J. Mol. Biol.

    (1998)
  • S Jabbouri et al.

    nolO and noeI (HsnIII) ofRhisobiumsp. NFR234 are involved in 3-O-carbamoylation and 2-O-methylation of Nod factors

    J. Biol. Chem.

    (1998)
  • M Komoszynski et al.

    Apyrases (ATP disphosphohydrolases, EC 3.6.1.5)function and relationship to ATPases

    Biochim. Biophys. Acta

    (1996)
  • E.V Koonin

    Yeast protein controlling inter-organelle communication is related to bacterial phosphatases containing the Hsp70-type ATP-binding domain

    Trends Biochem. Sic.

    (1994)
  • A Lupas

    Prediction and analysis of coiled-coil structures

    Methods Enzymol.

    (1996)
  • J Martin et al.

    Chaperone-assisted protein folding

    Curr. Opin. Struct. Biol.

    (1997)
  • A Mellors et al.

    O-sialoglycoprotease fromPasteurella haemolytica

    Methods Enzymol.

    (1995)
  • A.G Murzin

    Structural classification of proteinsnew superfamilies

    Curr. Opin. Struct. Biol.

    (1996)
  • A.G Murzin

    How far divergent evolution goes in proteins

    Curr. Opin. Struct. Biol.

    (1998)
  • A.G Murzin et al.

    SCOPa structural classification of proteins database for the investigation of sequences and structures

    J. Mol. Biol.

    (1995)
  • M.A Oettinger

    Cutting apart V(D)J recombination

    Curr. Opin. Genet. Dev.

    (1996)
  • J Reizer et al.

    Exopolyphosphate phosphatase and guanosine pentaphosphate phosphatase belong to the sugar kinse/actin/hsp 70 superfamily

    Trends Biochem. Sci.

    (1993)
  • S.G Rhee et al.

    The role of adehylyltransferase and uridylyltransferase in the regulation of glutamine synthetase inEscherichia coli

    Curr. Top. Cell Reg.

    (1985)
  • L Rychlewski et al.

    Fold and function predictions forMycoplasma genitaliumproteins

    Fold. Design

    (1998)
  • S Shuman

    Closing the gap on DNA ligase

    Structure

    (1996)
  • H.S Subramanya et al.

    Crystal structure of an ATP-dependent DNA ligase from bacteriophage T7

    Cell

    (1996)
  • D.C Van Gent et al.

    Initiation of V(D)J recombination in a cell-free system

    Cell

    (1995)
  • T.F Wang et al.

    Golgi localization and functional expression of human uridine diphosphatase

    J. Biol. Chem.

    (1998)
  • J.C Wootton

    Non-globular domains in protein sequencesautomated segmentation using complexing measures

    Comput. Chem.

    (1994)
  • J.C Wootton et al.

    Analysis of composiitonally biased regions in sequence databases

    Methods Enzymol.

    (1996)
  • K.M Abdullah et al.

    A neutral glycoprotease ofPasteurella haemolyticaA1 specifically cleaves O-sialoglycoproteins

    Infect. Immun.

    (1992)
  • A Agrawal et al.

    Transposition mediated by RAG1 and RAG2 and its implications for the evolution of the immune system

    Nature

    (1998)
  • S.F Altschul et al.

    Gapped BLAST and PSI-BLASTa new generation of protein database search programs

    Nucl. Acids Res.

    (1997)
  • L Aravind et al.

    Phosphoesterase domains associated with DNA polymerases of diverse origins

    Nucl. Acids Res.

    (1998)
  • L Aravind et al.

    DNA polymerase β-like nucleotidyltransferase superfamilyidentification of three new families, classification and evolutionary history

    Nucl. Acids Res.

    (1999)
  • L Aravind et al.

    Toprim-a conserved catalytic domain in tyupe IA and II topoisomerases, DnaG-type primases, OLD family nucleases and RecR proteins

    Nucl. Acids Res.

    (1998)
  • F Arigoni et al.

    A genome-based approach for the identification of essential bacterial genes

    Nature Biotechnol.

    (1998)
  • T.A Barton et al.

    Further characterization ofRenibacterium salmoninarumextracellular products

    Appl. Environ. Microbiol.

    (1997)
  • S.F Bellon et al.

    Crystal structure of the RAG1 dimerization domain reveals multiple zinc-binding motifs including a ovel zinc binuclear cluster

    Nature Struct. Biol.

    (1997)
  • J.M Berger et al.

    Structural similarities between topoisomerases that cleave one or both DNA strands

    Proc. Natl Acad. Sci. USA

    (1998)
  • M Boll et al.

    Benzoyl-coenzyme A reductase (dearomatizing), a key enzyme of anaerobic aromatic metabolism. ATP dependence of the reaction, purification and some properties of the enzyme fromThauera aromaticastrain K172

    Eur. J. Biochem.

    (1995)
  • P Bork et al.

    Predicting functions from protein sequences-where are the bottlenecks?

    Nature Genet.

    (1998)
  • P Bork et al.

    An ATPase domain common to prokaryotic cell cycle proteins, sugar kinases, actin, and hsp70 heat shock proteins

    Proc. Natl Acad. Sci. USA

    (1992)
  • I Callebaut et al.

    The V(D)J recombination activating protein RAG2 consists of a six-bladed properller and a PHD fingerlike domain, as revealed by sequence analysis

    Cell Mol. Life Sci.

    (1998)
  • M Cashel et al.

    The stringent response

  • Cited by (393)

    • mTORC1: Upstream and Downstream

      2022, Encyclopedia of Cell Biology: Volume 1-6, Second Edition
    • Searching protein space for ancient sub-domain segments

      2021, Current Opinion in Structural Biology
    View all citing articles on Scopus
    1

    Edited by J. M. Thornton

    View full text