MMTSB Tool Set: enhanced sampling and multiscale modeling methods for applications in structural biology

https://doi.org/10.1016/j.jmgm.2003.12.005Get rights and content

Abstract

We describe the Multiscale Modeling Tools for Structural Biology (MMTSB) Tool Set (http://mmtsb.scripps.edu/software/mmtsbToolSet.html), which is a novel set of utilities and programming libraries that provide new enhanced sampling and multiscale modeling techniques for the simulation of proteins and nucleic acids. The tool set interfaces with the existing molecular modeling packages CHARMM and Amber for classical all-atom simulations, and with MONSSTER for lattice-based low-resolution conformational sampling. In addition, it adds new functionality for the integration and translation between both levels of detail. The replica exchange method is implemented to allow enhanced sampling of both the all-atom and low-resolution models. The tool set aims at applications in structural biology that involve protein or nucleic acid structure prediction, refinement, and/or extended conformational sampling. With structure prediction applications in mind, the tool set also implements a facility that allows the control and application of modeling tasks on a large set of conformations in what we have termed ensemble computing. Ensemble computing encompasses loosely coupled, parallel computation on high-end parallel computers, clustered computational grids and desktop grid environments.

This paper describes the design and implementation of the MMTSB Tool Set and illustrates its utility with three typical examples—scoring of a set of predicted protein conformations in order to identify the most native-like structures, ab initio folding of peptides in implicit solvent with the replica exchange method, and the prediction of a missing fragment in a larger protein structure.

Introduction

The success of computational methodologies in chemistry that have been developed over the last four decades is reflected in a multitude of academic and commercial programs available today. CHARMM [1], Amber [2], and Gaussian [3] are typical examples of this development, and enjoy wide usage in both academia and industry. Most of these programs that have emerged from this period are highly functional, well optimized, and sufficiently integrated within their intended range of applications. However, because of a high level of complexity, proprietary command interfaces and input/output formats these programs often tend to be inflexible when extensions and/or interoperability with other existing programs are needed. While this is a common problem in the integration of heterogeneous legacy software components [4], such issues have become especially apparent in the implementation of new enhanced sampling techniques applied to the conformational sampling of biopolymers. These novel simulation protocols combine existing methods in order to improve conformational sampling efficiency for molecular modeling and dynamics applications. Generalized ensemble sampling techniques, for example, involve parallel simulations of a system of interest with different weight factors coupled by a Monte Carlo simulation protocol [5], [6], [7]. Variants of this sampling scheme are being used increasingly in the study of long time scale phenomena such as protein folding [8], [9], [10], [11], [12], [13]. These methods could be implemented in the form of separate, new programs or by modifying existing simulation packages. In a more practical implementation, however, existing programs could be used to run each of the simulations while an external interface layer is utilized to couple and control the individual simulations and facilitates the enhanced sampling methodology. This approach would allow greater flexibility in using the same enhanced sampling method with different simulation programs, and avoid difficulties in modifying existing large software packages directly.

Another way to improve conformational sampling is through multiscale modeling techniques. The computational modeling of biological macromolecules commonly revolves around structure representations in atomic or near-atomic detail, with a classical description of physical interactions. Such models have been quite successful in complementing experimental data with structural, dynamic, and energetic information, but involve substantial computational resources for larger systems, or when long time scales have to be considered. In particular, studies of protein folding, structure prediction applications, or the formation and interaction of supramolecular assemblies become prohibitively expensive with models at atomic detail. Alternatively, coarser molecular representations with few virtual particles, often also projected onto lattices, have yielded meaningful results in such cases [14]. Unfortunately, the reduced level of detail often cannot provide the same accuracy as all-atom models. For example, it is quite feasible to generate native topologies from folding simulations using simple lattice models; however, it is much more difficult to actually discern native or near-native conformations from other, non-native conformations that are also generated with the same model [15]. In such cases, one may instead reconstruct all-atom structures from reduced representations [16], and use these more detailed models to regain a higher level of accuracy with an all-atom scoring function that can then distinguish native from non-native conformations [17], [18]. This idea represents the core of more general multiscale modeling approaches; lower resolution models are used to extend sampling to longer time scales or larger system sizes, whereas higher-resolution models provide the energetic accuracy. While the structure prediction example above describes a single pass of low-resolution sampling followed by the use of all-atom models for improved accuracy, multiscale modeling can also be done in a continuous fashion, for example through Monte Carlo type simulations that repeatedly move between low- and high-resolution models for extended sampling on an energy landscape that is closely coupled to the interactions of the high-resolution model. The implementation of multiscale modeling methods faces problems similar to these seen in the implementation of enhanced sampling methods, but usually involves the combination of multiple programs rather than a single simulation program. All-atom modeling of biological macromolecules is possible with a number of standard molecular modeling packages such as CHARMM or Amber, but these programs usually do not fully support low-resolution models and especially lattice-based representations. On the other hand, simulation programs for low-resolution models, such as the lattice simulation program MONSSTER [19], do not usually allow all-atom modeling. Both types of applications are fairly complex, so that the option of simply merging them is not very attractive. As for the implementation of enhanced sampling methods, a better solution would be to wrap simulation programs for all-atom and low-resolution models through a common interface layer and provide translation routines between both models as the basis for building multiscale applications.

In this paper, we describe a new set of utilities and programming libraries for the implementation of computationally distributed enhanced and multiscale sampling methods based on existing simulation programs. This package, called Multiscale Modeling Tools for Structural Biology (MMTSB) Tool Set (available at http://mmtsb.scripps.edu/software/mmtsbToolSet.html), is an effort within the NIH Research Resource for Multiscale Modeling Tools for Structural Biology and follows the implementation strategy outlined above by integrating the existing programs through an interface layer while providing missing functionality as necessary. Interpreted scripting languages such as Perl or Python are particularly suitable for building interface layers since they combine ease of use and portability with a high level of functionality for addressing the complex system-oriented but computationally less intensive tasks [20]. Similar, scripting-language based designs have been used successfully in other related applications such as the molecular modeling tool kit (MMTK) [21] or the Bioperl toolkit [22].

The idea of the MMTSB Tool Set is not just to provide a set of user programs for certain enhanced and multiscale sampling modeling tasks, but also a programming workbench, which provides the framework for the development of new applications that require the interplay of multiple simulation packages. It focuses on applications in the area of protein structure prediction, protein folding, and large-scale model building and refinement of proteins and nucleic acids for which enhanced and multiscale sampling techniques are particularly useful. As a subset of its functionalities, the tool set also provides a common user interface to all-atom modeling via CHARMM1 [1] or Amber1 [2] and reduced-model lattice modeling via MONSSTER [19]. Furthermore, the tool set incorporates a number of support functions that are motivated by multiscale modeling applications, but are certainly useful for other purposes as well. They include algorithms for translating quickly and accurately between low- and high-resolution models and methods for the organization, manipulation, and evaluation of large sets of conformations for a given protein, in what may be referred to as ensemble computing. Ensemble computing applications greatly benefit from parallel execution since they are inherently parallel in nature and typically require relatively little communication. The tool set provides basic parallel platform support implemented on the scripting language level, which makes it largely platform-independent and does not require specific communication libraries.

In the following, we will first describe the architecture and components of the MMTSB Tool Set in more detail. We will then continue by providing examples of how the tool set may be used for typical enhanced and multiscale sampling applications in protein structure prediction, structure evaluation, and structure refinement examples. We conclude by discussing how this architecture may be extended to new tasks and applications.

Section snippets

Architecture

Common modern scripting languages that would be appropriate for building complex applications are Perl and Python. We decided to use Perl as the (still) more widely used scripting language in order to minimize portability issues and to facilitate user extensions as much as possible. As depicted in Fig. 1, the architecture of the MMTSB Tool Set consists of a collection of object-oriented classes, called packages in Perl, that implement all of the core functionalities. These packages are used by

All-atom modeling

The central part of the all-atom modeling components revolves around interfaces to the molecular mechanics packages CHARMM [1] and Amber [2]. In this respect the MMTSB Tool Set may be viewed as an alternative user interface to CHARMM and Amber for certain standard modeling tasks. The tool set utilities are meant to provide access to these powerful programs without requiring the user to go through the learning curve of understanding the specific command and data input and output protocols of

Low-resolution modeling

Low-resolution modeling within the MMTSB Tool Set is based on the MONSSTER program [19]. MONSSTER implements the SICHO (side CHain only) model where each amino acid in a polypeptide chain is represented by a single virtual particle located at the side chain center of mass and projected onto a cubic lattice with 1.45 Å grid spacing [28]. Such a model is particularly well suited for constant temperature or simulated annealing type Monte Carlo simulations based on an energy function that is

Translation between all-atom and low-resolution models

Both levels of detail, all-atom and low-resolution representations, are brought together by MMTSB functions that allow the generation of lattice chains from all-atom structures and the reconstruction of all-atom structures from lattice chains. Such mapping functions are essential for a multiscale modeling strategy and should preserve initial structures as much as possible through complete translation cycles. The utility for the generation of low-resolution models from all-atom structures is

Ensemble computing

Certain applications such as structure prediction, docking experiments, or estimates of conformational or interaction energies often involve relatively large ensembles of different conformations for a system of interest. Such ensembles may be assembled from simulation snapshots, the endpoints of simulated annealing runs as with the low-resolution lattice model described above, or by other means of conformational sampling. In many cases the ensemble structures are then evaluated and compared in

Replica exchange simulations

An exploration of the potential energy landscape for a system of interest, usually with the goal of finding low-lying regions, is the central theme of most molecular modeling applications. Sampling efficiency with standard simulation techniques such as molecular dynamics or Monte Carlo at a given temperature is governed by the distribution and height of energetic barriers, or ruggedness, and the slope towards the energy minimum in the landscape, both of which determine the kinetic behavior of

Advanced multiscale sampling methods

The utilities for lattice-based low-resolution sampling, for all-atom sampling, and for the translation between low-resolution and all-atom models can be combined to implement a basic multiscale modeling protocol. This is provided with the utility predict.pl, which integrates these steps into a single pass from low-resolution sampling to all-atom based scoring for structure prediction applications. More complex multiscale modeling protocols, however, may involve the continuous transition

Structure analysis functions

A number of utilities in the MMTSB Tool Set can be used for limited structure analysis tasks. They include functions such as clustering or the calculation of root mean square deviations and optimal superposition between two conformations, calculation of the radius of gyration, the fraction of native contacts, or standard peptide chain dihedral angles φ, ψ, ω, and χ1. In ensemble computing applications, most of these structural properties can be calculated in parallel for a whole set of ensemble

Applications

Having provided an overview of the different components of the MMTSB Tool Set, we now present a few typical applications that illustrate the use of the tool set—scoring of previously generated protein conformations with the ensemble computing facility, folding of peptides via replica exchange simulations, and the prediction of a missing fragment in the context of a known structure.

Generation of an ensemble data structure from input files

As the first step, an ensemble is generated from the set of predicted structures by using the checkin.pl utility:

The file names of the predictions submitted to CASP follow the format T0125.pdb for this target and the structures are given an identifying tag casp in the newly created ensemble. Now that the predicted structures are available in ensemble format, ensemble computing tools can be used for further processing.

Preprocessing of input structures

Depending on how the input structures were generated, it is often a good idea to regularize and minimize the structures before calculating energy scores. Many structure predictions do not contain a complete set of atoms. Often, hydrogen atoms are missing and some predictions may consist only of Cα coordinates. Therefore, as the first step we will run the complete.pl utility in order to generate complete, all-atom structures for all of the predictions. Since we want to apply this command to all

Evaluation of scoring function

Finally, we can evaluate a scoring function for the minimized structures. The ensemble computing tool for energy evaluation, enseval.pl, is used as follows:

Here, we are using a scoring function that includes implicit solvation based on a generalized Born formalism [27], in this case the GBMV method [39], [40], as implemented in CHARMM as the default when GB is requested. Again, four CPUs are used in parallel to speed up the calculation. In this case the total energy of the entire molecular

Analysis of results

Once the enseval.pl run is complete the scores are available and can be queried with getprop.pl. A sorted list of all values is obtained easily with the following command:

In this command the property name score and structure tag min are used to identify the data set. Such a result may be sufficient for many applications, but often it is advantageous to form clusters of input structures based on mutual similarity and then compare average scores over cluster members to identify the lowest scoring

Generation of model conformations from lattice simulations

In the first step, conformations for the missing fragment are generated using lattice-based low-resolution sampling. As input for this step only a sequence file and the template structure are needed. The sequence file contains the entire sequence for the template as well as the missing part. It also provides secondary structure information that is trivially obtained for the template and can be predicted for the missing part with good reliability using a variety of different secondary structure

Selection of protein environment near region of interest

The sampled structures could now be minimized, scored, and analyzed as in the example above. With 200 residues, the complete protein is fairly large, and both the minimization and energy evaluation steps are relatively expensive. Since only a small part of the structure has been varied in the sampling protocol, one does not necessarily need to consider the entire structure. The part of the structure in the vicinity of the variable residues can be cut out according to a distance cutoff similar

Scoring of conformations

Following the example above, the sampled conformations are first minimized before being scored with an energy function that includes implicit solvation based on the generalized Born formalism [39], [40]:

In this case, we create a minimized structure under the tag cutmin. Options specifying restraints to keep the cutout region intact during the minimization as described above are read from an options file generated automatically by enscut.pl when the structures were reduced.

Clustering and analysis

At this point energy scores are available for the sampled conformations and we can proceed to cluster the sampled conformations based on mutual root mean square deviations with the command enscluster.pl:

Since we are primarily interested in the conformation of residues 48 to 55, we will cluster only based on these residues, disregarding the surrounding template. A quick view of the resulting clusters is available with the command showcluster.pl:

In this example, we find 7 clusters with sizes

Summary

We have introduced the MMTSB Tool Set, a collection of utilities and programming libraries aimed at enhanced sampling and multiscale modeling applications in structural biology. The tool set interfaces with the standard molecular modeling packages CHARMM and Amber for all-atom modeling and with MONSSTER for low-resolution lattice-based simulations. It adds a number of functions, such as the translation between all atom and low resolution representations, and implements replica exchange sampling

Acknowledgements

We thank the first users of the MMTSB Tool Set for testing and useful suggestions. Financial support from the NIH supported resource Multiscale Modeling Tools for Structural Biology (http://mmtsb.scripps.edu) (grant RR12255 to CLB III) is acknowledged. Support for the computational infrastructure and personnel (MF) that enabled these calculations came from the DOD through the grant DAMD17-03-2-0012 and is greatly appreciated.

References (56)

  • U.H.E. Hansmann

    Parallel tempering algorithm for conformational studies of biological molecules

    Chem. Phys. Lett.

    (1997)
  • M.J. Bower et al.

    Prediction of protein side-chain Rotamers from a backbone-dependent Rotamer library: a new homology modeling tool

    J. Mol. Biol.

    (1997)
  • B. Zagrovic et al.

    β-hairpin folding simulations in atomistic detail using an implicit solvent model

    J. Mol. Biol.

    (2001)
  • X. Daura

    Reversible peptide folding in solution by molecular dynamics simulation

    J. Mol. Biol.

    (1998)
  • B.R. Brooks

    CHARMM: a program for macromolecular energy, minimization, and dynamics calculations

    J. Comput. Chem.

    (1983)
  • M.J. Frisch, et al., Gaussian 98. Gaussian, Inc, Pittsburgh, PA,...
  • A.C. Siepel

    An integration platform for heterogeneous bioinformatics software components

    IBM Syst. J.

    (2001)
  • K.Y. Sanbonmatsu et al.

    Structure of met-enkephalin in explicit aqueous solution using replica exchange molecular dynamics

    Proteins

    (2002)
  • R. Zhou et al.

    The free energy landscape for β hairpin folding in explicit water

    Proc. Nat. Acad. Sci. U.S.A.

    (2001)
  • A.E. Garcia et al.

    Exploring the energy landscape of a β hairpin in explicit solvent

    Proteins

    (2001)
  • D. Gront et al.

    A new combination of replica exchange Monte Carlo and histogram analysis for protein folding and thermodynamics

    J. Chem. Phys.

    (2001)
  • J. Skolnick et al.

    A unified approach to the prediction of protein structure and function

    Adv. Chem. Phys.

    (2002)
  • H. Lu et al.

    A distance-dependent atomic knowledge-based potential for improved protein structure selection

    Proteins

    (2001)
  • M. Feig

    Accurate reconstruction of all-atom protein representations from side-chain-based low-resolution models

    Proteins

    (2000)
  • M. Feig, C.L. Brooks, III, Evaluating CASP4 predictions with physical energy functions, Proteins 49 (2002)...
  • C. Simmerling

    Combining MONSSTER and LES/PME to predict protein structure from amino acid sequence: application to the small protein CMTI-1

    J. Am. Chem. Soc.

    (2000)
  • D.M. Beazley, An extensible compiler for creating scritable scientific software, in: Proceedings of Lecture Notes in...
  • K. Hinsen

    The molecular modeling tool kit: a new approach to molecular simulations

    J. Comput. Chem.

    (2000)
  • Cited by (0)

    View full text