Regular article
A graph-theoretic algorithm for comparative modeling of protein structure1

https://doi.org/10.1006/jmbi.1998.1689Get rights and content

Abstract

The interconnected nature of interactions in protein structures appears to be the major hurdle in preventing the construction of accurate comparative models. We present an algorithm that uses graph theory to handle this problem. Each possible conformation of a residue in an amino acid sequence is represented using the notion of a node in a graph. Each node is given a weight based on the degree of the interaction between its side-chain atoms and the local main-chain atoms. Edges are then drawn between pairs of residue conformations/nodes that are consistent with each other (i.e. clash-free and satisfying geometrical constraints). The edges are weighted based on the interactions between the atoms of the two nodes. Once the entire graph is constructed, all the maximal sets of completely connected nodes (cliques) are found using a clique-finding algorithm. The cliques with the best weights represent the optimal combinations of the various main-chain and side-chain possibilities, taking the respective environments into account. The algorithm is used in a comparative modeling scenario to build side-chains, regions of main chain, and mix and match between different homologs in a context-sensitive manner. The predictive power of this method is assessed by applying it to cases where the experimental structure is not known in advance.

Introduction

The rapidly increasing number of known protein structures has resulted in a situation where approximate structures corresponding to new sequences are often available from one of two sources. First, when the sequence of interest is clearly related to those of one or more known structures, then the overall folds are the same (Chothia & Lesk, 1986). This is the case now for about 30% of the general sequences entering the databases (Schneider & Sander, 1996), and about 10% of genome sequences (Scharf et al., 1994). Second, even when there is no detectable sequence relationship, the techniques of threading make it possible to recognize the fold in many cases (Levitt, 1997). Extrapolating this trend, it appears that the routine generation of approximate models of protein structure from sequence may soon become a reality.

To be of much practical use, these approximate structures need to be refined into detailed accurate models. The technique for doing this is usually termed comparative or homology modeling. In contrast to progress in generating approximate structures, this process has turned out to be more difficult. Objective testing of the methods Mosimann et al 1995, Martin et al 1997 shows that the current numerical techniques offer little advantage over simply copying large parts of the related structures. There are two fundamental difficulties to be overcome. First, methods must deal with the compact states of the polypeptide chain, where steric exclusion effects makes the energy surface extremely discontinuous, so that search methods that make semi-random moves such as Monte Carlo Chen 1989, Skolnick and Kolinski 1990, Wilson and Cui 1990, Okamoto et al 1991, Abagyan and Totrov 1994, Avbelj and Moult 1995 and genetic algorithms Sun 1993, Unger and Moult 1993, Pedersen and Moult 1997 have difficulty finding acceptable conformations, while continuous search methods such as molecular dynamics appear to get stuck in local parts of the space (Venclovas et al., 1997). Second, in many instances the details of the conformation are highly context-dependent: for example, the conformation adopted by a particular stretch of polypeptide chain can be different, depending on the environment it is in (Samudrala et al., 1995). Thus, to be effective, an algorithm must be able to cope with the discontinuous nature of the search space, and to take into account an extensive web of interactions.

In computing science, the notion of a “graph” has been used to describe many systems that are made up of such interconnected networks (Harel, 1992). These include laying out the shortest combination of railroad segments between a network of cities (finding minimal spanning trees), finding the shortest paths between any two cities in a network of cities, and finding the shortest path in a city network, passing through all the cities exactly once (the famous Travelling Salesman problem). In computational chemistry and biology, graph-theoretic approaches have been used to enumerate chemical isomers (National Research Council, 1995) and for protein structure comparison Grindley et al 1993, Artymiuk et al 1995.

Our goal is to find the best set of interactions in a protein structure, given a variety of side-chain and main-chain conformational choices for each position in the structure. We present an algorithm based on graph theory that will find the optimal arrangement of all these choices, as measured by some discriminatory function, while adequately considering the context-sensitivity seen in protein structures.

Specifically, we present conformations of parts of protein molecules, usually single amino acid residues, as nodes in a graph. Edges are drawn between self-consistent sub-conformations and nodes and edges are weighted with some fitness function. The maximal completely connected graphs (cliques) then represent possible conformations of the molecule, and those with the best weight are assumed to be the most native-like. In principle, the method can be applied to any structure prediction problem. In practice, computational limitations on the number of combinations of conformations that can be considered make it most suitable for comparative modeling applications.

There are three principle advantages to the graph-theoretic representation: (1) it provides a simple framework in which to consider the combinations of possible sub-conformations systematically, avoiding the need for following a trajectory through the rough energy landscape. (2) It provides control over which sub-conformations to include, allowing resources to be focused on the more uncertain aspects of the structure. (3) Pre-calculation of the fitness of each node, and of the interaction between pairs of nodes, greatly reduces the computational cost of evaluating a conformation, allowing many more combinations to be considered compared to a conventional fitness calculation.

Comparative modeling can be regarded as a series of steps. Generally, an alignment between the sequence to be modeled (the target) and a related sequence with known structure (the parent of the template) is first constructed Greer 1990, Mosimann et al 1995. An initial partial model is then built by copying the main-chain coordinates from the parent structure(s) for equivalent residues in the alignment. Some side-chain conformations may be inferred from the parent structures. Remaining parts of the structure (insertions in the target sequence relative to the parent, rejoining of the chain around deletions, regions of chain with low levels of sequence homology between the target and parent, regions where an alternative parent structure may result in a more accurate model, and other side-chain conformations) must then be built.

We describe how these building steps can be accomplished with the graph-theoretic clique-finding method. We summarize the results from the second experiment on the Critical Assessment of protein Structure Prediction methods (CASP2), where the method was applied to build comparative models of sequences for which the experimental structures were not then known (Samudrala & Moult, 1997).

Section snippets

General description

Each possible conformation of a residue in an amino acid sequence is represented as a node in a graph. Edges are then drawn between pairs of residues/nodes that are consistent with each other. Edges and nodes are weighted according to some fixed criteria. Once the entire graph is constructed, all the maximal sets of completely connected nodes (cliques) are found using a clique-finding (CF) algorithm. The cliques with the best weight are considered to be similar to the native structure. Figure 1

Building side-chains in a comparative modeling scenario

We illustrate how the clique finding method performs for building side-chains using a comparative target and corresponding model from the first experiment on the Critical Assessment of protein Structure Prediction methods (CASP1). The target is the histidine-containing phosphocarrier protein (hpr) from Mycoplasma capricolum, an 89 residue protein (Pieper et al., 1995). In the model of this structure we built for CASP1, 27 of 67 χ1 angles deviated more than 30° from the experimental structure.

General effectiveness of the algorithm

The graph-theoretic clique-finding method is a member of the class of algorithms for protein structure prediction that are based on a partial enumeration and evaluation of the possible structures. There are three interlocked components to such a procedure: adequate sampling of the conformations of substructures, filtering out combinations of substructures that are clearly non-viable, and scoring complete structures for fitness. Performance is affected by all these factors: a poor quality

Acknowledgements

We thank Jan Pedersen for help in using the AbM database method and for constructive advice. This work was supported in part by a Life Technologies Fellowship to R.S. and by NIH grant GM41034 to J.M. Some computations were performed using NIST computing resources.

References (38)

  • T. Bhat et al.

    Bound water molecules and conformational stabilisation help mediate an antigen-antibody association

    Proc. Natl Acad. Sci. USA

    (1994)
  • J. Bowie et al.

    Method to identify protein sequences that fold into a known three-dimensional structure

    Science

    (1991)
  • C. Bron et al.

    Algorithm 457finding all cliques of an undirected graph

    Commun. ACM

    (1973)
  • R. Chen

    Monte Carlo simulations for the study of hemoglobin-fragment conformations

    J. Comput. Chem

    (1989)
  • C. Chothia et al.

    The relation between the divergence of sequence and structure in proteins

    EMBO J

    (1986)
  • S. Chung et al.

    The use of side-chain packing methods in modeling bacteriophage repressor and cro proteins

    Protein Sci

    (1995)
  • K. Fidelis et al.

    Comparison of systematic search and database methods for constructing segments of protein structure

    Protein Eng

    (1994)
  • J. Greer

    Comparative modeling methods: application to the family of the mammalian serine proteases

    Proteins: Struct. Funct. Genet

    (1990)
  • D. Harel

    Algorithmics

    (1992)
  • Cited by (126)

    • An exact algorithm for the multi-trip container drayage problem with truck platooning

      2023, Transportation Research Part E: Logistics and Transportation Review
    • Maximal independent sets in clique-free graphs

      2022, European Journal of Combinatorics
    • Sliding into the Future: Investigating Sliding Windows in Temporal Graphs

      2023, Leibniz International Proceedings in Informatics, LIPIcs
    • Applications of Maximum Independent Set

      2022, AIP Conference Proceedings
    View all citing articles on Scopus
    1

    Edited by F. Cohen

    View full text