Regular article
Protein folding simulations with genetic algorithms and a detailed molecular description1

https://doi.org/10.1006/jmbi.1997.1010Get rights and content

Abstract

We have explored the application of genetic algorithms (GA) to the determination of protein structure from sequence, using a full atom representation. A free energy function with point charge electrostatics and an area based solvation model is used. The method is found to be superior to previously investigated Monte Carlo algorithms. For selected fragments, up to 14 residues long, the lowest free energy structures produced by the GA are similar in conformation to the corresponding experimental structures in most cases. There are three main conclusions from these results. First, the genetic algorithm is an effective method for searching amongst the compact conformations of a polypeptide chain. Second, the free energy function is generally able to select native-like conformations. However, some deficiencies are identified, and further development is proposed. Third, the selection of native-like conformations for some protein fragments establishes that in these cases the conformation observed in the full protein structure is largely context independent. The implications for the nature of protein folding pathways are discussed.

Introduction

In order to determine protein structure from amino acid sequence, two central problems must be overcome: some form of free energy function must be developed that is able to distinguish between the functional conformation and all the rest, and a conformational search method must be devized that can find that functional conformation using available computing resources.

The most realistic method to date for studying protein motion is Cartesian space molecular dynamics, using an explicit description of all the protein and solvent molecule atoms Allen and Tildesley 1989, Brooks et al 1991. Force fields at this level are highly evolved Aqvist et al 1985, Weiner et al 1984, Brooks et al 1983, Hagler 1985, Lee et al 1995, Kang et al 1996, and a number of studies have shown that there are significant free energy minima near to the functional conformation Brunne et al 1995, Kitson et al 1993. It is not known whether, given enough computer time, simulations started from arbitrary points would converge to the functional conformation. The very large computing requirements have so far prevented simulation of the behavior of proteins or even peptides for more than a few nanoseconds, whereas a period of microseconds to milliseconds would be needed to reproduce observed in vitro folding behavior of peptides (Williams et al., 1996) and small proteins (e.g. Kuszewski et al., 1994). The principal limitation on the simulation scale is that the energy surface with a full atom description is very fine grained. That is, the energy changes rapidly as a function of atomic position. This problem becomes more acute as a molecular system becomes more compact.

To overcome these difficulties, two parallel strategies have been pursued. First, the description of the system has been simplified, to smooth the energy surface. Simplification ranges from the use of implicit solvation models (Lazaridis et al., 1995), slightly (Brooks & Head-Gordon, 1991), moderately (Srinivasan & Rose, 1995) or drastically (Sun et al., 1995) reduced side-chain descriptions, approximate main-chain descriptions (Brooks & Head-Gordon, 1991), and lattice models (Kolinski & Skolnick, 1994), extending to one lattice point per two residues (Park & Levitt, 1996). It is clear that there must be a price to pay for these simplifications, but so far there have been few studies Mounge et al 1995, DeBolt and Skolnick 1996, Huang et al 1996 to establish to what degree the different simplifications affect the ability to identify the functional conformation reliably. We have taken a relatively conservative approach. An implicit solvent description is used, together with all oxygen, nitrogen, carbon and sulfur atoms, as well as polar hydrogen atoms. An empirical force field, including a Coulomb law description of electrostatics with partial atomic charges is used. Advantage is taken of potential of mean force methods (Avbelj & Moult, 1995b) to parameterize this and other terms against experimental protein structures. The ability of this force field to identify conformations close to the functional one has been investigated (Braxenthaler et al., 1996).

The second strategy to over come the search problem is to take larger steps in the conformational space. Covalent geometry (i.e. bond lengths and bond angles, and planarity of conjugated atoms sets) is approximately constant for subgroups of protein atoms, so that conformational freedom may be expressed in terms of the values of the dihedral angles around single bonds. Although molecular dynamics in this space allows a larger time step, greater computational cost partly offsets the gains (Rice & Brünger, 1994). The restrictive move size and the cost of calculating derivatives in traditional molecular dynamics is overcome by using a Monte Carlo (MC) or genetic algorithm (GA) procedure. Much larger changes in conformation can be obtained through the MC procedure. Large changes in one or more dihedral angles are introduced, and the resulting conformation accepted if the evaluated free energy decreases, or if the increase is not too large. A number of MC studies of peptide folding have been made Abagyan and Totrov 1994, Avbelj and Moult 1995a. The disadvantage is that, because the space of a protein molecule is densely packed with atoms, many moves will be rejected as energetically unacceptable.

We have previously investigated the effectiveness of a torsion space MC procedure at finding the experimental conformation of fragments of protein molecules (Avbelj & Moult, 1995a). The method was partly successful, in that in a number of cases low free energy, native-like conformations were generated. However, competing lower free energy but less native-like conformations were often encountered, and the lowest free energy structures were not as low as the corresponding minimized experimental structures. Three possible explanations for these limitations are: (1) Since these are fragments of proteins, the preferred conformation may not necessarily be that found in the full context. (2) The free energy function may have a large number of false minima. (3) The search may not have converged to the lowest free energy conformation possible. The fact that the experimental structures consistently had slightly lower values supported the hypothesis that this last factor played a significant role. We therefore sought a more effective way of searching in torsion space in the compact states of a protein molecule.

Genetic Algorithms (GA) offer one way of searching more effectively in crowded spaces. They were first introduced by Holland (1975) to simulate the processes of natural selection at the genetic level, but rapidly gained acceptance as general search methods (Goldberg, 1989). Possible solutions to a search problem are represented by genes, and a fitness function is used to evaluate the fitness of each gene. A population of genes evolves by three mechanisms. (1) Point mutations, in which the value at a specific place on a single gene is changed. (2) Crossovers, in which a portion of one gene replaces the equivalent portion of another one. (3) Survival of the fittest genes from one generation to the next. Tests on two and three-dimensional lattice models of proteins Unger and Moult 1993a, Unger and Moult 1993b have demonstrated that GAs are more effective than MC at finding the global energy minimum. It is still less clear how well the method performs on more detailed descriptions of a protein chain.

Dandekar and Argos 1992, Dandekar and Argos 1994 used genetic algorithms to fold Cα backbone models of proteins in real space off lattice. A simplified bit-string encoding of φ/ψ space, allowing each φ/ψ pair to adapt one of seven possible angle combinations (Rooman et al., 1991) was used. Each side-chain was represented by a sphere of 1.9 Å and the global energy or fitness function consisted of hydrogen bonding, secondary structure preference and a hydrophobic scatter term, each of which were scaled according to heuristic constants. The method generally relied on the correct pre-assignment of secondary structure for success.

Sun et al. (1995) used a method similar to that of Dandekar and Argos 1992, Dandekar and Argos 1994 in that all secondary structure was explicitly introduced. A full backbone representation was used, with one virtual atom per side-chain. A simulation consisted of mutations of five degree torsion changes and crossovers at non-secondary structure positions. This procedure was able to find conformations which had a low root-mean-square (RMS) deviation to the experimental structure in five of the seven test cases.

Sun (1993) has presented the most detailed GA for protein folding to date. A full atom polypeptide backbone together with one atom at the centroid of each side-chain was used. The GA was driven by a dictionary of di,, tri-, tetra-, and penta-peptides. A peptide segment was chosen from the database of peptides by sequence. The search method was applied to two small proteins melletin and pancreatic polypeptide. The conformation of a small cyclical peptide was also reproduced. All the structures were generated with an RMS deviation of 1.7 Å within the final population in the GA simulation. These results were impressive, although the proteins simulated were present in the dictionary which was used to derive the GA. The search method is said to be approximately 100 to 200 times more efficient than MC simulated annealing protocols.

A more detailed review of the use of GAs for protein structure prediction may be found in Pedersen & Moult (1996).

The search method, presented here, is tested on a set of 28 fragments of protein structures, up to 14 residues long. The fragments were selected on the basis of experimental data and energetic criteria indicating a preference to adopt a native-like structure independent of the presence of the rest of the protein. They were identified from a survey of the current literature on the folding of proteins and protein fragments. These fragments have the advantage of being relatively small, and therefore present a more tractable search problem than complete proteins. A second advantage is that generation of a native-like conformation in an objective search provides added information about the context independence of a fragment, and its possible role in early folding.

The structure of the rest of the paper is as follows. In the first section, the force field is briefly introduced. Next, details of the GA implementation are described, together with a study of optimization of the GA parameters. Then the results of determining the conformation of a standard set of protein fragments are presented, followed by an analysis of some of the factors contributing to problem cases. Finally, we discuss the significance of the results for the further development of search methods and force fields, and explore the implications of the results for the mechanisms of protein folding.

Section snippets

Force field

The force field is the same as that used in a previous MC study Avbelj and Moult 1995a, Avbelj and Moult 1995b, Avbelj 1992. These papers should be consulted for full details. A short summary is given here.

All nitrogen, oxygen, carbon and polar hydrogen atoms are represented explicitly. Hydrogen atoms on aliphatic and aromatic carbon atoms are merged with those atoms in the usual united atom approximation. All intramolecular electrostatic contributions are calculated using atomic partial

Dihedral angle library

For each residue type, a set of observed φ, ψ, and χ angles was compiled from a library of 255 non-redundant protein structures (Holm & Sander, 1994; 20/20 set, release of October, 1994). A processed version of the Brookhaven files in this library is available at (Braxenthaler et al., 1996). Any protein containing a fragment to be simulated was removed from the database (i.e. the procedure was “Jack Knifed”). These angle sets were used to bias selected angles towards the experimental

GA parameter optimization

Trial GA simulations were performed to establish the best set of conditions. The following parameters were optimized: length of population generating MC simulation, population size, fraction of conformations transferred directly to the next generation, frequency of ΔGlocal standard deviation recalculation, extent of joint search, and extent of side-chain annealing. All trial simulations were performed on residues 10 to 22 of the bacterial Ribonuclease Barnase (Baudet & Janin, 1991), referred to

Discussion

There are three questions to be addressed. How effective is the genetic algorithm as a search technique for peptides and proteins? How well does the free energy function perform? What are the implications of the results for early events in protein folding?

Acknowledgements

We thank NIST for use of the IBM SP2 parallel computer, and NIST computing staff for supporting us in its use. This work was partly funded by grants DOC/NIST 60NANBD1594 and NIH GM40134.

References (69)

  • A. Horovitz et al.

    Co-operative interactions during protein folding

    J. Mol. Biol.

    (1992)
  • E.S. Huang et al.

    Using a hydrophobic contact potential to evaluate native and near-native folds generated by molecular dynamics simulations

    J. Mol. Biol.

    (1996)
  • T. Lazaridis et al.

    Enthalpic contribution to protein stabilityinsights from atom-based calculations and statistical mechanics

    Advan. Protein Chem.

    (1995)
  • H. Lee et al.

    Accurate crystal molecular dynamics simulations using particle-mesh-EwaldRNA dinucleotides ApU and GpC

    Chem. Phys. Letters

    (1995)
  • A. Matouschek et al.

    The folding of an enzyme: V. H/2H exchange-nuclear magnetic resonance studies on the folding pathway of barnase: complementarity to and agreement with protein engineering studies

    J. Mol. Biol.

    (1992)
  • B. Park et al.

    Energy functions that discriminate X-ray and near native folds from well-constructed decoys

    J. Mol. Biol.

    (1996)
  • J.T. Pedersen et al.

    Genetic algorithms for protein structure prediction

    Curr. Opin. Struct. Biol.

    (1996)
  • M. Rooman et al.

    Prediction of protein backbone conformation based on seven structural assignments

    J. Mol. Biol.

    (1991)
  • J. Sancho et al.

    An N-terminal fragment of Barnase has residual helical structure similar to that of a refolding intermediate

    J. Mol. Biol.

    (1992)
  • H.A. Scheraga

    Recent developments in the theory of protein foldingsearching for the global energy minimum

    Biophys. Chem.

    (1996)
  • A. Shrake et al.

    Environment and exposure to solvent of protein atoms

    J. Mol. Biol.

    (1973)
  • R. Unger et al.

    Effect of mutations on the performance of genetic algorithms suitable for protein folding simulations, in computer aided innovation of new materials

    Comput.-Aided Innovat. New Mater.

    (1993)
  • R. Unger et al.

    Genetic algorithms for protein folding simulations

    J. Mol. Biol.

    (1993)
  • M.P. Allen et al.

    Computer Simulation of Liquids

    (1989)
  • D.E. Anderson et al.

    pH-induced denaturation of proteinsa salt bridge contributes 3-5 kcal/mol to the free energy of folding of T4 lysozyme

    Biochemistry

    (1990)
  • F. Avbelj

    Use of a potential of mean force to analyse free energy contributions in protein folding

    Biochemistry

    (1992)
  • F. Avbelj et al.

    Determination of the conformations of folding initiation sites in proteins by computer simulation

    Proteins: Struct. Funct. Genet.

    (1995)
  • F. Avbelj et al.

    The role of electrostatic screening in determining protein main chain conformational preferences

    Biochemistry

    (1995)
  • F.J. Blanco et al.

    Folding of protein G B1 domain studied by the conformational characterization of fragments comprising its secondary structure elements

    Eur. J. Biochem.

    (1995)
  • D.T. Brandt et al.

    The role of dipole interactions in determining polypeptide conformation

    J. Am. Chem. Soc.

    (1965)
  • M. Braxenthaler et al.

    Carb biocomputing web pages, force field and parameters

    (1996)
  • B. Brooks et al.

    CHARMMa program for macromolecular energy, minimization, and dynamics calculations

    J. Comput. Chem.

    (1983)
  • C.L.I. Brooks et al.

    Virtual rigid body dynamics

    Biopolymers

    (1991)
  • C.L.I. Brooks et al.

    G. Nemethy on proteinsa theoretical perspective of dynamics structure and thermodynamics

    Advan. Chem. Phys. Bull. Mathemat. Biol.

    (1991)
  • Cited by (117)

    • Soft computing methods for the prediction of protein tertiary structures: A survey

      2015, Applied Soft Computing Journal
      Citation Excerpt :

      For the representation, authors use some amino acid features,e.g., partial charge and van der Waals bond. In [82], a global free energy function of an unfolded conformation is calculated as the sum of threes types of energy: local backbone electrostatic energy, which represents the sum of the interactions between NH and CO groups among amino acids, intra-molecular electrostatic energy, that reflects the rest of intra-molecular interactions, and solvation free energy, that takes into account different interactions as hydrogen bonding, ion–dipole, and dipole–dipole attractions or van der Waals forces. Finally, in [127], author develops a tabu search algorithm for the torsion angles prediction.

    • Three-dimensional protein structure prediction: Methods and computational strategies

      2014, Computational Biology and Chemistry
      Citation Excerpt :

      Eq. (1) presents the CHARMM force field (MacKerell et al., 1998). Energy surface search techniques: methods for Ab initio prediction include Molecular Dynamics simulations of proteins and protein-substrate complexes (van Gunsteren and Berendsen, 1990; Rapaport, 2004; Koza, 1992); Monte Carlo simulations that do not use forces but rather compare energies, via the use of Boltzmann probabilities (Simons et al., 1997); Genetic Algorithms which are based on populations of solutions by iterative cycles of operations (Holland, 1993; Pokala and Handel, 2000) and try to improve on the sampling and the convergence of Monte Carlo approaches (Pedersen and Moult, 1997; Tuffery et al., 1991; Bowie and Eisenberg, 1994), and exhaustive and semi-exhaustive lattice-based studies which are based on using a crude/approximate fold representation (such as two residues per lattice point) and then exploring all or large amounts of the conformational space given the crude representation. There are many computational packages used in Ab initio protein structure simulations.

    View all citing articles on Scopus
    1

    Edited by F. E. Cohen

    View full text