Protein folding simulations with genetic algorithms and a detailed molecular description

doi:10.1006/jmbi.1997.1010

Journal of Molecular Biology

Volume 269, Issue 2, 6 June 1997, Pages 240-259

https://doi.org/10.1006/jmbi.1997.1010 Get rights and content

Abstract

We have explored the application of genetic algorithms (GA) to the determination of protein structure from sequence, using a full atom representation. A free energy function with point charge electrostatics and an area based solvation model is used. The method is found to be superior to previously investigated Monte Carlo algorithms. For selected fragments, up to 14 residues long, the lowest free energy structures produced by the GA are similar in conformation to the corresponding experimental structures in most cases. There are three main conclusions from these results. First, the genetic algorithm is an effective method for searching amongst the compact conformations of a polypeptide chain. Second, the free energy function is generally able to select native-like conformations. However, some deficiencies are identified, and further development is proposed. Third, the selection of native-like conformations for some protein fragments establishes that in these cases the conformation observed in the full protein structure is largely context independent. The implications for the nature of protein folding pathways are discussed.

Introduction

In order to determine protein structure from amino acid sequence, two central problems must be overcome: some form of free energy function must be developed that is able to distinguish between the functional conformation and all the rest, and a conformational search method must be devized that can find that functional conformation using available computing resources.

The most realistic method to date for studying protein motion is Cartesian space molecular dynamics, using an explicit description of all the protein and solvent molecule atoms Allen and Tildesley 1989, Brooks et al 1991. Force fields at this level are highly evolved Aqvist et al 1985, Weiner et al 1984, Brooks et al 1983, Hagler 1985, Lee et al 1995, Kang et al 1996, and a number of studies have shown that there are significant free energy minima near to the functional conformation Brunne et al 1995, Kitson et al 1993. It is not known whether, given enough computer time, simulations started from arbitrary points would converge to the functional conformation. The very large computing requirements have so far prevented simulation of the behavior of proteins or even peptides for more than a few nanoseconds, whereas a period of microseconds to milliseconds would be needed to reproduce observed in vitro folding behavior of peptides (Williams et al., 1996) and small proteins (e.g. Kuszewski et al., 1994). The principal limitation on the simulation scale is that the energy surface with a full atom description is very fine grained. That is, the energy changes rapidly as a function of atomic position. This problem becomes more acute as a molecular system becomes more compact.

To overcome these difficulties, two parallel strategies have been pursued. First, the description of the system has been simplified, to smooth the energy surface. Simplification ranges from the use of implicit solvation models (Lazaridis et al., 1995), slightly (Brooks & Head-Gordon, 1991), moderately (Srinivasan & Rose, 1995) or drastically (Sun et al., 1995) reduced side-chain descriptions, approximate main-chain descriptions (Brooks & Head-Gordon, 1991), and lattice models (Kolinski & Skolnick, 1994), extending to one lattice point per two residues (Park & Levitt, 1996). It is clear that there must be a price to pay for these simplifications, but so far there have been few studies Mounge et al 1995, DeBolt and Skolnick 1996, Huang et al 1996 to establish to what degree the different simplifications affect the ability to identify the functional conformation reliably. We have taken a relatively conservative approach. An implicit solvent description is used, together with all oxygen, nitrogen, carbon and sulfur atoms, as well as polar hydrogen atoms. An empirical force field, including a Coulomb law description of electrostatics with partial atomic charges is used. Advantage is taken of potential of mean force methods (Avbelj & Moult, 1995b) to parameterize this and other terms against experimental protein structures. The ability of this force field to identify conformations close to the functional one has been investigated (Braxenthaler et al., 1996).

The second strategy to over come the search problem is to take larger steps in the conformational space. Covalent geometry (i.e. bond lengths and bond angles, and planarity of conjugated atoms sets) is approximately constant for subgroups of protein atoms, so that conformational freedom may be expressed in terms of the values of the dihedral angles around single bonds. Although molecular dynamics in this space allows a larger time step, greater computational cost partly offsets the gains (Rice & Brünger, 1994). The restrictive move size and the cost of calculating derivatives in traditional molecular dynamics is overcome by using a Monte Carlo (MC) or genetic algorithm (GA) procedure. Much larger changes in conformation can be obtained through the MC procedure. Large changes in one or more dihedral angles are introduced, and the resulting conformation accepted if the evaluated free energy decreases, or if the increase is not too large. A number of MC studies of peptide folding have been made Abagyan and Totrov 1994, Avbelj and Moult 1995a. The disadvantage is that, because the space of a protein molecule is densely packed with atoms, many moves will be rejected as energetically unacceptable.

We have previously investigated the effectiveness of a torsion space MC procedure at finding the experimental conformation of fragments of protein molecules (Avbelj & Moult, 1995a). The method was partly successful, in that in a number of cases low free energy, native-like conformations were generated. However, competing lower free energy but less native-like conformations were often encountered, and the lowest free energy structures were not as low as the corresponding minimized experimental structures. Three possible explanations for these limitations are: (1) Since these are fragments of proteins, the preferred conformation may not necessarily be that found in the full context. (2) The free energy function may have a large number of false minima. (3) The search may not have converged to the lowest free energy conformation possible. The fact that the experimental structures consistently had slightly lower values supported the hypothesis that this last factor played a significant role. We therefore sought a more effective way of searching in torsion space in the compact states of a protein molecule.

Genetic Algorithms (GA) offer one way of searching more effectively in crowded spaces. They were first introduced by Holland (1975) to simulate the processes of natural selection at the genetic level, but rapidly gained acceptance as general search methods (Goldberg, 1989). Possible solutions to a search problem are represented by genes, and a fitness function is used to evaluate the fitness of each gene. A population of genes evolves by three mechanisms. (1) Point mutations, in which the value at a specific place on a single gene is changed. (2) Crossovers, in which a portion of one gene replaces the equivalent portion of another one. (3) Survival of the fittest genes from one generation to the next. Tests on two and three-dimensional lattice models of proteins Unger and Moult 1993a, Unger and Moult 1993b have demonstrated that GAs are more effective than MC at finding the global energy minimum. It is still less clear how well the method performs on more detailed descriptions of a protein chain.

Dandekar and Argos 1992, Dandekar and Argos 1994 used genetic algorithms to fold C^α backbone models of proteins in real space off lattice. A simplified bit-string encoding of φ/ψ space, allowing each φ/ψ pair to adapt one of seven possible angle combinations (Rooman et al., 1991) was used. Each side-chain was represented by a sphere of 1.9 Å and the global energy or fitness function consisted of hydrogen bonding, secondary structure preference and a hydrophobic scatter term, each of which were scaled according to heuristic constants. The method generally relied on the correct pre-assignment of secondary structure for success.

Sun et al. (1995) used a method similar to that of Dandekar and Argos 1992, Dandekar and Argos 1994 in that all secondary structure was explicitly introduced. A full backbone representation was used, with one virtual atom per side-chain. A simulation consisted of mutations of five degree torsion changes and crossovers at non-secondary structure positions. This procedure was able to find conformations which had a low root-mean-square (RMS) deviation to the experimental structure in five of the seven test cases.

Sun (1993) has presented the most detailed GA for protein folding to date. A full atom polypeptide backbone together with one atom at the centroid of each side-chain was used. The GA was driven by a dictionary of di,, tri-, tetra-, and penta-peptides. A peptide segment was chosen from the database of peptides by sequence. The search method was applied to two small proteins melletin and pancreatic polypeptide. The conformation of a small cyclical peptide was also reproduced. All the structures were generated with an RMS deviation of 1.7 Å within the final population in the GA simulation. These results were impressive, although the proteins simulated were present in the dictionary which was used to derive the GA. The search method is said to be approximately 100 to 200 times more efficient than MC simulated annealing protocols.

A more detailed review of the use of GAs for protein structure prediction may be found in Pedersen & Moult (1996).

The search method, presented here, is tested on a set of 28 fragments of protein structures, up to 14 residues long. The fragments were selected on the basis of experimental data and energetic criteria indicating a preference to adopt a native-like structure independent of the presence of the rest of the protein. They were identified from a survey of the current literature on the folding of proteins and protein fragments. These fragments have the advantage of being relatively small, and therefore present a more tractable search problem than complete proteins. A second advantage is that generation of a native-like conformation in an objective search provides added information about the context independence of a fragment, and its possible role in early folding.

The structure of the rest of the paper is as follows. In the first section, the force field is briefly introduced. Next, details of the GA implementation are described, together with a study of optimization of the GA parameters. Then the results of determining the conformation of a standard set of protein fragments are presented, followed by an analysis of some of the factors contributing to problem cases. Finally, we discuss the significance of the results for the further development of search methods and force fields, and explore the implications of the results for the mechanisms of protein folding.

Section snippets

Force field

The force field is the same as that used in a previous MC study Avbelj and Moult 1995a, Avbelj and Moult 1995b, Avbelj 1992. These papers should be consulted for full details. A short summary is given here.

All nitrogen, oxygen, carbon and polar hydrogen atoms are represented explicitly. Hydrogen atoms on aliphatic and aromatic carbon atoms are merged with those atoms in the usual united atom approximation. All intramolecular electrostatic contributions are calculated using atomic partial

Dihedral angle library

For each residue type, a set of observed φ, ψ, and χ angles was compiled from a library of 255 non-redundant protein structures (Holm & Sander, 1994; 20/20 set, release of October, 1994). A processed version of the Brookhaven files in this library is available at (Braxenthaler et al., 1996). Any protein containing a fragment to be simulated was removed from the database (i.e. the procedure was “Jack Knifed”). These angle sets were used to bias selected angles towards the experimental

GA parameter optimization

Trial GA simulations were performed to establish the best set of conditions. The following parameters were optimized: length of population generating MC simulation, population size, fraction of conformations transferred directly to the next generation, frequency of ΔG_local standard deviation recalculation, extent of joint search, and extent of side-chain annealing. All trial simulations were performed on residues 10 to 22 of the bacterial Ribonuclease Barnase (Baudet & Janin, 1991), referred to

Discussion

There are three questions to be addressed. How effective is the genetic algorithm as a search technique for peptides and proteins? How well does the free energy function perform? What are the implications of the results for early events in protein folding?

Acknowledgements

We thank NIST for use of the IBM SP2 parallel computer, and NIST computing staff for supporting us in its use. This work was partly funded by grants DOC/NIST 60NANBD1594 and NIH GM40134.

References (69)

R. Abagyan et al.
Biased probability Monte Carlo conformational searches and electrostatic calculations for peptides and proteins
J. Mol. Biol.
(1994)
D.O.V. Alonso et al.
Molecular dynamics simulations of protein unfolding and limited refoldingcharacterization of partially unfolded states of ubiquitin in 60methanol and in water
J. Mol. Biol.
(1995)
J. Åqvist et al.
A molecular dynamics study of the C-terminal fragment of the l7/l12 ribosomal proteinsecondary structure motion in a 150 picosecond trajectory
J. Mol. Biol.
(1985)
S. Baudet et al.
Crystal structure of a barnase complex at 1.9 Å resolution
J. Mol. Biol.
(1991)
F. Bernstein et al.
The protein databanka computer based archival file for macromolecular structures
J. Mol. Biol.
(1977)
T. Dandekar et al.
Folding the main-chain of small proteins with the genetic algorithm
J. Mol. Biol.
(1994)
H.J. Dyson et al.
Folding of peptide fragments comprising the complete sequence of proteins. Models for initiation of protein folding. I. Myohermerythrin
J. Mol. Biol.
(1992)
H.J. Dyson et al.
Folding of peptide fragments comprising the complete sequence of proteins. Models for initiation of protein folding. I. Plastocyanin
J. Mol. Biol.
(1992)
J. Garnier et al.
Analysis of the accuracy and implications of simple models for predicting the secondary structure of globular proteins
J. Mol. Biol.
(1978)
A.T. Hagler

A. Horovitz et al.

Co-operative interactions during protein folding

J. Mol. Biol.

(1992)

E.S. Huang et al.

Using a hydrophobic contact potential to evaluate native and near-native folds generated by molecular dynamics simulations

J. Mol. Biol.

(1996)

T. Lazaridis et al.

Enthalpic contribution to protein stabilityinsights from atom-based calculations and statistical mechanics

Advan. Protein Chem.

(1995)

H. Lee et al.

Accurate crystal molecular dynamics simulations using particle-mesh-EwaldRNA dinucleotides ApU and GpC

Chem. Phys. Letters

(1995)

A. Matouschek et al.

The folding of an enzyme: V. H/²H exchange-nuclear magnetic resonance studies on the folding pathway of barnase: complementarity to and agreement with protein engineering studies

J. Mol. Biol.

(1992)

B. Park et al.

Energy functions that discriminate X-ray and near native folds from well-constructed decoys

J. Mol. Biol.

(1996)

J.T. Pedersen et al.

Genetic algorithms for protein structure prediction

Curr. Opin. Struct. Biol.

(1996)

M. Rooman et al.

Prediction of protein backbone conformation based on seven structural assignments

J. Mol. Biol.

(1991)

J. Sancho et al.

An N-terminal fragment of Barnase has residual helical structure similar to that of a refolding intermediate

J. Mol. Biol.

(1992)

H.A. Scheraga

Recent developments in the theory of protein foldingsearching for the global energy minimum

Biophys. Chem.

(1996)

A. Shrake et al.

Environment and exposure to solvent of protein atoms

J. Mol. Biol.

(1973)

R. Unger et al.

Effect of mutations on the performance of genetic algorithms suitable for protein folding simulations, in computer aided innovation of new materials

Comput.-Aided Innovat. New Mater.

(1993)

R. Unger et al.

Genetic algorithms for protein folding simulations

J. Mol. Biol.

(1993)

M.P. Allen et al.

Computer Simulation of Liquids

(1989)

D.E. Anderson et al.

pH-induced denaturation of proteinsa salt bridge contributes 3-5 kcal/mol to the free energy of folding of T4 lysozyme

Biochemistry

(1990)

F. Avbelj

Use of a potential of mean force to analyse free energy contributions in protein folding

Biochemistry

(1992)

F. Avbelj et al.

Determination of the conformations of folding initiation sites in proteins by computer simulation

Proteins: Struct. Funct. Genet.

(1995)

F. Avbelj et al.

The role of electrostatic screening in determining protein main chain conformational preferences

Biochemistry

(1995)

F.J. Blanco et al.

Folding of protein G B1 domain studied by the conformational characterization of fragments comprising its secondary structure elements

Eur. J. Biochem.

(1995)

D.T. Brandt et al.

The role of dipole interactions in determining polypeptide conformation

J. Am. Chem. Soc.

(1965)

M. Braxenthaler et al.

Carb biocomputing web pages, force field and parameters

(1996)

B. Brooks et al.

CHARMMa program for macromolecular energy, minimization, and dynamics calculations

J. Comput. Chem.

(1983)

C.L.I. Brooks et al.

Virtual rigid body dynamics

Biopolymers

(1991)

C.L.I. Brooks et al.

G. Nemethy on proteinsa theoretical perspective of dynamics structure and thermodynamics

Advan. Chem. Phys. Bull. Mathemat. Biol.

(1991)

Cited by (117)

Stochastic numerical approach for solving second order nonlinear singular functional differential equation
2019, Applied Mathematics and Computation
A new computational intelligence numerical scheme is presented for the solution of second order nonlinear singular functional differential equations (FDEs) using artificial neural networks (ANNs), global operator genetic algorithms (GAs), efficient local operator interior-point algorithm (IPA), and the hybrid combination of GA-IPA. An unsupervised error function is assembled for the DDE optimized by ANNs using the hybrid combination of GA-IPA. Three kinds of the second order nonlinear singular DDEs have been solved numerically and compared their results with the exact solutions to authenticate the performance and exactness of the present designed scheme. Moreover, statistical analysis based on Mean absolute deviation, Theil's inequality coefficient and Nash Sutcliffe efficiency is also performed to validate the convergence and accuracy of the present scheme.
APL: An angle probability list to improve knowledge-based metaheuristics for the three-dimensional protein structure prediction
2015, Computational Biology and Chemistry
Tertiary protein structure prediction is one of the most challenging problems in structural bioinformatics. Despite the advances in algorithm development and computational strategies, predicting the folded structure of a protein only from its amino acid sequence remains as an unsolved problem. We present a new computational approach to predict the native-like three-dimensional structure of proteins. Conformational preferences of amino acid residues and secondary structure information were obtained from protein templates stored in the Protein Data Bank and represented as an Angle Probability List. Two knowledge-based prediction methods based on Genetic Algorithms and Particle Swarm Optimization were developed using this information. The proposed method has been tested with twenty-six case studies selected to validate our approach with different classes of proteins and folding patterns. Stereochemical and structural analysis were performed for each predicted three-dimensional structure. Results achieved suggest that the Angle Probability List can improve the effectiveness of metaheuristics used to predicted the three-dimensional structure of protein molecules by reducing its conformational search space.
Soft computing methods for the prediction of protein tertiary structures: A survey
2015, Applied Soft Computing Journal
Citation Excerpt :
For the representation, authors use some amino acid features,e.g., partial charge and van der Waals bond. In [82], a global free energy function of an unfolded conformation is calculated as the sum of threes types of energy: local backbone electrostatic energy, which represents the sum of the interactions between NH and CO groups among amino acids, intra-molecular electrostatic energy, that reflects the rest of intra-molecular interactions, and solvation free energy, that takes into account different interactions as hydrogen bonding, ion–dipole, and dipole–dipole attractions or van der Waals forces. Finally, in [127], author develops a tabu search algorithm for the torsion angles prediction.
The problem of protein structure prediction (PSP) represents one of the most important challenges in computational biology. Determining the three dimensional structure of proteins is necessary to understand their functions at molecular level. The most representative soft computing approaches for solving the protein tertiary structure prediction problem are summarized in this paper. These approaches have been categorized following the type of methodology. A total of 90 relevant works published in last 15 years in the field of protein structure prediction have been reported, including the best competitors in last CASP editions. However, despite large research effort in last decades, a considerable scope for further improvement still remains in this area.
Three-dimensional protein structure prediction: Methods and computational strategies
2014, Computational Biology and Chemistry
Citation Excerpt :
Eq. (1) presents the CHARMM force field (MacKerell et al., 1998). Energy surface search techniques: methods for Ab initio prediction include Molecular Dynamics simulations of proteins and protein-substrate complexes (van Gunsteren and Berendsen, 1990; Rapaport, 2004; Koza, 1992); Monte Carlo simulations that do not use forces but rather compare energies, via the use of Boltzmann probabilities (Simons et al., 1997); Genetic Algorithms which are based on populations of solutions by iterative cycles of operations (Holland, 1993; Pokala and Handel, 2000) and try to improve on the sampling and the convergence of Monte Carlo approaches (Pedersen and Moult, 1997; Tuffery et al., 1991; Bowie and Eisenberg, 1994), and exhaustive and semi-exhaustive lattice-based studies which are based on using a crude/approximate fold representation (such as two residues per lattice point) and then exploring all or large amounts of the conformational space given the crude representation. There are many computational packages used in Ab initio protein structure simulations.
A long standing problem in structural bioinformatics is to determine the three-dimensional (3-D) structure of a protein when only a sequence of amino acid residues is given. Many computational methodologies and algorithms have been proposed as a solution to the 3-D Protein Structure Prediction (3-D-PSP) problem. These methods can be divided in four main classes: (a) first principle methods without database information; (b) first principle methods with database information; (c) fold recognition and threading methods; and (d) comparative modeling methods and sequence alignment strategies. Deterministic computational techniques, optimization techniques, data mining and machine learning approaches are typically used in the construction of computational solutions for the PSP problem. Our main goal with this work is to review the methods and computational strategies that are currently used in 3-D protein prediction.
New set of 2D/3D thermodynamic indices for proteins. A formalism based on "Molten Globule" theory
2010, Physics Procedia
We define eight new macromolecular indices, and several related descriptors for proteins. The coarse grained methodology used for its deduction ensures its fast execution and becomes a powerful potential tool to explore large databases of protein structures. The indices are intended for stability studies, predicting $Φ$ -values, predicting folding rate constants, protein QSAR/QSPR as well as protein alignment studies. Also, these indices could be used as scoring function in protein-protein docking or 3D protein structure prediction algorithms and any others applications which need a numerical code for proteins and/or residues from 2D or 3D format.
An efficient hybrid Taguchi-genetic algorithm for protein folding simulation
2009, Expert Systems with Applications
Given the amino-acid sequence of a protein, the prediction of a protein’s tertiary structure is known as the protein folding problem. The protein folding problem in the hydrophobic–hydrophilic lattice model is to find the lowest energy conformation. In order to enhance the performance of predicting protein structure, in this paper we propose an efficient hybrid Taguchi-genetic algorithm that combines genetic algorithm, Taguchi method, and particle swarm optimization (PSO). The GA has the capability of powerful global exploration, while the Taguchi method can exploit the optimum offspring. In addition, we present the PSO inspired by a mutation mechanism in a genetic algorithm. We demonstrate that our algorithm can be applied successfully to the protein folding problem based on the hydrophobic-hydrophilic lattice model. Simulation results indicate that our approach performs very well against existing evolutionary algorithm.

View all citing articles on Scopus

¹: Edited by F. E. Cohen

View full text

Journal of Molecular Biology

Regular articleProtein folding simulations with genetic algorithms and a detailed molecular description1

Abstract

Introduction

Section snippets

Force field

Dihedral angle library

GA parameter optimization

Discussion

Acknowledgements

J. Mol. Biol.

J. Mol. Biol.

J. Mol. Biol.

J. Mol. Biol.

J. Mol. Biol.

J. Mol. Biol.

J. Mol. Biol.

J. Mol. Biol.

J. Mol. Biol.

J. Mol. Biol.

J. Mol. Biol.

Advan. Protein Chem.

Chem. Phys. Letters

J. Mol. Biol.

J. Mol. Biol.

Curr. Opin. Struct. Biol.

J. Mol. Biol.

J. Mol. Biol.

Biophys. Chem.

J. Mol. Biol.

Comput.-Aided Innovat. New Mater.

J. Mol. Biol.

Computer Simulation of Liquids

pH-induced denaturation of proteinsa salt bridge contributes 3-5 kcal/mol to the free energy of folding of T4 lysozyme

Biochemistry

Use of a potential of mean force to analyse free energy contributions in protein folding

Biochemistry

Determination of the conformations of folding initiation sites in proteins by computer simulation

Proteins: Struct. Funct. Genet.

The role of electrostatic screening in determining protein main chain conformational preferences

Biochemistry

Folding of protein G B1 domain studied by the conformational characterization of fragments comprising its secondary structure elements

Eur. J. Biochem.

The role of dipole interactions in determining polypeptide conformation

J. Am. Chem. Soc.

Carb biocomputing web pages, force field and parameters

CHARMMa program for macromolecular energy, minimization, and dynamics calculations

J. Comput. Chem.

Virtual rigid body dynamics

Biopolymers

G. Nemethy on proteinsa theoretical perspective of dynamics structure and thermodynamics

Advan. Chem. Phys. Bull. Mathemat. Biol.

Regular article
Protein folding simulations with genetic algorithms and a detailed molecular description¹