Journal of Molecular Biology
Regular articleA graph-theoretic algorithm for comparative modeling of protein structure1
Introduction
The rapidly increasing number of known protein structures has resulted in a situation where approximate structures corresponding to new sequences are often available from one of two sources. First, when the sequence of interest is clearly related to those of one or more known structures, then the overall folds are the same (Chothia & Lesk, 1986). This is the case now for about 30% of the general sequences entering the databases (Schneider & Sander, 1996), and about 10% of genome sequences (Scharf et al., 1994). Second, even when there is no detectable sequence relationship, the techniques of threading make it possible to recognize the fold in many cases (Levitt, 1997). Extrapolating this trend, it appears that the routine generation of approximate models of protein structure from sequence may soon become a reality.
To be of much practical use, these approximate structures need to be refined into detailed accurate models. The technique for doing this is usually termed comparative or homology modeling. In contrast to progress in generating approximate structures, this process has turned out to be more difficult. Objective testing of the methods Mosimann et al 1995, Martin et al 1997 shows that the current numerical techniques offer little advantage over simply copying large parts of the related structures. There are two fundamental difficulties to be overcome. First, methods must deal with the compact states of the polypeptide chain, where steric exclusion effects makes the energy surface extremely discontinuous, so that search methods that make semi-random moves such as Monte Carlo Chen 1989, Skolnick and Kolinski 1990, Wilson and Cui 1990, Okamoto et al 1991, Abagyan and Totrov 1994, Avbelj and Moult 1995 and genetic algorithms Sun 1993, Unger and Moult 1993, Pedersen and Moult 1997 have difficulty finding acceptable conformations, while continuous search methods such as molecular dynamics appear to get stuck in local parts of the space (Venclovas et al., 1997). Second, in many instances the details of the conformation are highly context-dependent: for example, the conformation adopted by a particular stretch of polypeptide chain can be different, depending on the environment it is in (Samudrala et al., 1995). Thus, to be effective, an algorithm must be able to cope with the discontinuous nature of the search space, and to take into account an extensive web of interactions.
In computing science, the notion of a “graph” has been used to describe many systems that are made up of such interconnected networks (Harel, 1992). These include laying out the shortest combination of railroad segments between a network of cities (finding minimal spanning trees), finding the shortest paths between any two cities in a network of cities, and finding the shortest path in a city network, passing through all the cities exactly once (the famous Travelling Salesman problem). In computational chemistry and biology, graph-theoretic approaches have been used to enumerate chemical isomers (National Research Council, 1995) and for protein structure comparison Grindley et al 1993, Artymiuk et al 1995.
Our goal is to find the best set of interactions in a protein structure, given a variety of side-chain and main-chain conformational choices for each position in the structure. We present an algorithm based on graph theory that will find the optimal arrangement of all these choices, as measured by some discriminatory function, while adequately considering the context-sensitivity seen in protein structures.
Specifically, we present conformations of parts of protein molecules, usually single amino acid residues, as nodes in a graph. Edges are drawn between self-consistent sub-conformations and nodes and edges are weighted with some fitness function. The maximal completely connected graphs (cliques) then represent possible conformations of the molecule, and those with the best weight are assumed to be the most native-like. In principle, the method can be applied to any structure prediction problem. In practice, computational limitations on the number of combinations of conformations that can be considered make it most suitable for comparative modeling applications.
There are three principle advantages to the graph-theoretic representation: (1) it provides a simple framework in which to consider the combinations of possible sub-conformations systematically, avoiding the need for following a trajectory through the rough energy landscape. (2) It provides control over which sub-conformations to include, allowing resources to be focused on the more uncertain aspects of the structure. (3) Pre-calculation of the fitness of each node, and of the interaction between pairs of nodes, greatly reduces the computational cost of evaluating a conformation, allowing many more combinations to be considered compared to a conventional fitness calculation.
Comparative modeling can be regarded as a series of steps. Generally, an alignment between the sequence to be modeled (the target) and a related sequence with known structure (the parent of the template) is first constructed Greer 1990, Mosimann et al 1995. An initial partial model is then built by copying the main-chain coordinates from the parent structure(s) for equivalent residues in the alignment. Some side-chain conformations may be inferred from the parent structures. Remaining parts of the structure (insertions in the target sequence relative to the parent, rejoining of the chain around deletions, regions of chain with low levels of sequence homology between the target and parent, regions where an alternative parent structure may result in a more accurate model, and other side-chain conformations) must then be built.
We describe how these building steps can be accomplished with the graph-theoretic clique-finding method. We summarize the results from the second experiment on the Critical Assessment of protein Structure Prediction methods (CASP2), where the method was applied to build comparative models of sequences for which the experimental structures were not then known (Samudrala & Moult, 1997).
Section snippets
General description
Each possible conformation of a residue in an amino acid sequence is represented as a node in a graph. Edges are then drawn between pairs of residues/nodes that are consistent with each other. Edges and nodes are weighted according to some fixed criteria. Once the entire graph is constructed, all the maximal sets of completely connected nodes (cliques) are found using a clique-finding (CF) algorithm. The cliques with the best weight are considered to be similar to the native structure. Figure 1
Building side-chains in a comparative modeling scenario
We illustrate how the clique finding method performs for building side-chains using a comparative target and corresponding model from the first experiment on the Critical Assessment of protein Structure Prediction methods (CASP1). The target is the histidine-containing phosphocarrier protein (hpr) from Mycoplasma capricolum, an 89 residue protein (Pieper et al., 1995). In the model of this structure we built for CASP1, 27 of 67 χ1 angles deviated more than 30° from the experimental structure.
General effectiveness of the algorithm
The graph-theoretic clique-finding method is a member of the class of algorithms for protein structure prediction that are based on a partial enumeration and evaluation of the possible structures. There are three interlocked components to such a procedure: adequate sampling of the conformations of substructures, filtering out combinations of substructures that are clearly non-viable, and scoring complete structures for fitness. Performance is affected by all these factors: a poor quality
Acknowledgements
We thank Jan Pedersen for help in using the AbM database method and for constructive advice. This work was supported in part by a Life Technologies Fellowship to R.S. and by NIH grant GM41034 to J.M. Some computations were performed using NIST computing resources.
References (38)
- et al.
Biased probability Monte Carlo conformational searches and electrostatic calculations for peptides and proteins
J. Mol. Biol
(1994) - et al.
Identification of tertiary structure resemblance in proteins using a maximal common subgraph isomorphism algorithm
J. Mol. Biol
(1993) - et al.
Evaluation of protein models by atomic solvation preference
J. Mol. Biol
(1992) - et al.
Refined structures of the active ser83-cys and impaired ser46-asp histidine-containing phosphocarrier proteins
Structure
(1994) - et al.
Folding simulation with genetic algorithms and a detailed molecular description
J. Mol. Biol
(1997) - et al.
Antibody modelingbeyond homology
Immunomethods
(1992) - et al.
Structural evidence for the evolutionary divergence of mycoplasma from Gram-positive bacteriathe histidine-containing phosphocarrier protein
Structure
(1995) - et al.
Genetic algorithms for protein folding simulations
J. Mol. Biol
(1993) - et al.
Comparison of protein folds and sidechain clusters using algorithms from graph theory
Proceedings of the CCP4 Study Weekend
(1995) - et al.
Determination of the conformation of folding initiation sites in proteins by computer simulation
Proteins: Struct. Funct. Genet
(1995)
Bound water molecules and conformational stabilisation help mediate an antigen-antibody association
Proc. Natl Acad. Sci. USA
Method to identify protein sequences that fold into a known three-dimensional structure
Science
Algorithm 457finding all cliques of an undirected graph
Commun. ACM
Monte Carlo simulations for the study of hemoglobin-fragment conformations
J. Comput. Chem
The relation between the divergence of sequence and structure in proteins
EMBO J
The use of side-chain packing methods in modeling bacteriophage repressor and cro proteins
Protein Sci
Comparison of systematic search and database methods for constructing segments of protein structure
Protein Eng
Comparative modeling methods: application to the family of the mammalian serine proteases
Proteins: Struct. Funct. Genet
Algorithmics
Cited by (126)
An exact algorithm for the multi-trip container drayage problem with truck platooning
2023, Transportation Research Part E: Logistics and Transportation ReviewMaximal independent sets in clique-free graphs
2022, European Journal of CombinatoricsSliding into the Future: Investigating Sliding Windows in Temporal Graphs
2023, Leibniz International Proceedings in Informatics, LIPIcsParallel Colorful h-star Core Maintenance in Dynamic Graphs
2023, Proceedings of the VLDB EndowmentApplications of Maximum Independent Set
2022, AIP Conference Proceedings
- 1
Edited by F. Cohen