MMTSB Tool Set: enhanced sampling and multiscale modeling methods for applications in structural biology
Introduction
The success of computational methodologies in chemistry that have been developed over the last four decades is reflected in a multitude of academic and commercial programs available today. CHARMM [1], Amber [2], and Gaussian [3] are typical examples of this development, and enjoy wide usage in both academia and industry. Most of these programs that have emerged from this period are highly functional, well optimized, and sufficiently integrated within their intended range of applications. However, because of a high level of complexity, proprietary command interfaces and input/output formats these programs often tend to be inflexible when extensions and/or interoperability with other existing programs are needed. While this is a common problem in the integration of heterogeneous legacy software components [4], such issues have become especially apparent in the implementation of new enhanced sampling techniques applied to the conformational sampling of biopolymers. These novel simulation protocols combine existing methods in order to improve conformational sampling efficiency for molecular modeling and dynamics applications. Generalized ensemble sampling techniques, for example, involve parallel simulations of a system of interest with different weight factors coupled by a Monte Carlo simulation protocol [5], [6], [7]. Variants of this sampling scheme are being used increasingly in the study of long time scale phenomena such as protein folding [8], [9], [10], [11], [12], [13]. These methods could be implemented in the form of separate, new programs or by modifying existing simulation packages. In a more practical implementation, however, existing programs could be used to run each of the simulations while an external interface layer is utilized to couple and control the individual simulations and facilitates the enhanced sampling methodology. This approach would allow greater flexibility in using the same enhanced sampling method with different simulation programs, and avoid difficulties in modifying existing large software packages directly.
Another way to improve conformational sampling is through multiscale modeling techniques. The computational modeling of biological macromolecules commonly revolves around structure representations in atomic or near-atomic detail, with a classical description of physical interactions. Such models have been quite successful in complementing experimental data with structural, dynamic, and energetic information, but involve substantial computational resources for larger systems, or when long time scales have to be considered. In particular, studies of protein folding, structure prediction applications, or the formation and interaction of supramolecular assemblies become prohibitively expensive with models at atomic detail. Alternatively, coarser molecular representations with few virtual particles, often also projected onto lattices, have yielded meaningful results in such cases [14]. Unfortunately, the reduced level of detail often cannot provide the same accuracy as all-atom models. For example, it is quite feasible to generate native topologies from folding simulations using simple lattice models; however, it is much more difficult to actually discern native or near-native conformations from other, non-native conformations that are also generated with the same model [15]. In such cases, one may instead reconstruct all-atom structures from reduced representations [16], and use these more detailed models to regain a higher level of accuracy with an all-atom scoring function that can then distinguish native from non-native conformations [17], [18]. This idea represents the core of more general multiscale modeling approaches; lower resolution models are used to extend sampling to longer time scales or larger system sizes, whereas higher-resolution models provide the energetic accuracy. While the structure prediction example above describes a single pass of low-resolution sampling followed by the use of all-atom models for improved accuracy, multiscale modeling can also be done in a continuous fashion, for example through Monte Carlo type simulations that repeatedly move between low- and high-resolution models for extended sampling on an energy landscape that is closely coupled to the interactions of the high-resolution model. The implementation of multiscale modeling methods faces problems similar to these seen in the implementation of enhanced sampling methods, but usually involves the combination of multiple programs rather than a single simulation program. All-atom modeling of biological macromolecules is possible with a number of standard molecular modeling packages such as CHARMM or Amber, but these programs usually do not fully support low-resolution models and especially lattice-based representations. On the other hand, simulation programs for low-resolution models, such as the lattice simulation program MONSSTER [19], do not usually allow all-atom modeling. Both types of applications are fairly complex, so that the option of simply merging them is not very attractive. As for the implementation of enhanced sampling methods, a better solution would be to wrap simulation programs for all-atom and low-resolution models through a common interface layer and provide translation routines between both models as the basis for building multiscale applications.
In this paper, we describe a new set of utilities and programming libraries for the implementation of computationally distributed enhanced and multiscale sampling methods based on existing simulation programs. This package, called Multiscale Modeling Tools for Structural Biology (MMTSB) Tool Set (available at http://mmtsb.scripps.edu/software/mmtsbToolSet.html), is an effort within the NIH Research Resource for Multiscale Modeling Tools for Structural Biology and follows the implementation strategy outlined above by integrating the existing programs through an interface layer while providing missing functionality as necessary. Interpreted scripting languages such as Perl or Python are particularly suitable for building interface layers since they combine ease of use and portability with a high level of functionality for addressing the complex system-oriented but computationally less intensive tasks [20]. Similar, scripting-language based designs have been used successfully in other related applications such as the molecular modeling tool kit (MMTK) [21] or the Bioperl toolkit [22].
The idea of the MMTSB Tool Set is not just to provide a set of user programs for certain enhanced and multiscale sampling modeling tasks, but also a programming workbench, which provides the framework for the development of new applications that require the interplay of multiple simulation packages. It focuses on applications in the area of protein structure prediction, protein folding, and large-scale model building and refinement of proteins and nucleic acids for which enhanced and multiscale sampling techniques are particularly useful. As a subset of its functionalities, the tool set also provides a common user interface to all-atom modeling via CHARMM1 [1] or Amber1 [2] and reduced-model lattice modeling via MONSSTER [19]. Furthermore, the tool set incorporates a number of support functions that are motivated by multiscale modeling applications, but are certainly useful for other purposes as well. They include algorithms for translating quickly and accurately between low- and high-resolution models and methods for the organization, manipulation, and evaluation of large sets of conformations for a given protein, in what may be referred to as ensemble computing. Ensemble computing applications greatly benefit from parallel execution since they are inherently parallel in nature and typically require relatively little communication. The tool set provides basic parallel platform support implemented on the scripting language level, which makes it largely platform-independent and does not require specific communication libraries.
In the following, we will first describe the architecture and components of the MMTSB Tool Set in more detail. We will then continue by providing examples of how the tool set may be used for typical enhanced and multiscale sampling applications in protein structure prediction, structure evaluation, and structure refinement examples. We conclude by discussing how this architecture may be extended to new tasks and applications.
Section snippets
Architecture
Common modern scripting languages that would be appropriate for building complex applications are Perl and Python. We decided to use Perl as the (still) more widely used scripting language in order to minimize portability issues and to facilitate user extensions as much as possible. As depicted in Fig. 1, the architecture of the MMTSB Tool Set consists of a collection of object-oriented classes, called packages in Perl, that implement all of the core functionalities. These packages are used by
All-atom modeling
The central part of the all-atom modeling components revolves around interfaces to the molecular mechanics packages CHARMM [1] and Amber [2]. In this respect the MMTSB Tool Set may be viewed as an alternative user interface to CHARMM and Amber for certain standard modeling tasks. The tool set utilities are meant to provide access to these powerful programs without requiring the user to go through the learning curve of understanding the specific command and data input and output protocols of
Low-resolution modeling
Low-resolution modeling within the MMTSB Tool Set is based on the MONSSTER program [19]. MONSSTER implements the SICHO (side CHain only) model where each amino acid in a polypeptide chain is represented by a single virtual particle located at the side chain center of mass and projected onto a cubic lattice with 1.45 Å grid spacing [28]. Such a model is particularly well suited for constant temperature or simulated annealing type Monte Carlo simulations based on an energy function that is
Translation between all-atom and low-resolution models
Both levels of detail, all-atom and low-resolution representations, are brought together by MMTSB functions that allow the generation of lattice chains from all-atom structures and the reconstruction of all-atom structures from lattice chains. Such mapping functions are essential for a multiscale modeling strategy and should preserve initial structures as much as possible through complete translation cycles. The utility for the generation of low-resolution models from all-atom structures is
Ensemble computing
Certain applications such as structure prediction, docking experiments, or estimates of conformational or interaction energies often involve relatively large ensembles of different conformations for a system of interest. Such ensembles may be assembled from simulation snapshots, the endpoints of simulated annealing runs as with the low-resolution lattice model described above, or by other means of conformational sampling. In many cases the ensemble structures are then evaluated and compared in
Replica exchange simulations
An exploration of the potential energy landscape for a system of interest, usually with the goal of finding low-lying regions, is the central theme of most molecular modeling applications. Sampling efficiency with standard simulation techniques such as molecular dynamics or Monte Carlo at a given temperature is governed by the distribution and height of energetic barriers, or ruggedness, and the slope towards the energy minimum in the landscape, both of which determine the kinetic behavior of
Advanced multiscale sampling methods
The utilities for lattice-based low-resolution sampling, for all-atom sampling, and for the translation between low-resolution and all-atom models can be combined to implement a basic multiscale modeling protocol. This is provided with the utility predict.pl, which integrates these steps into a single pass from low-resolution sampling to all-atom based scoring for structure prediction applications. More complex multiscale modeling protocols, however, may involve the continuous transition
Structure analysis functions
A number of utilities in the MMTSB Tool Set can be used for limited structure analysis tasks. They include functions such as clustering or the calculation of root mean square deviations and optimal superposition between two conformations, calculation of the radius of gyration, the fraction of native contacts, or standard peptide chain dihedral angles φ, ψ, ω, and χ1. In ensemble computing applications, most of these structural properties can be calculated in parallel for a whole set of ensemble
Applications
Having provided an overview of the different components of the MMTSB Tool Set, we now present a few typical applications that illustrate the use of the tool set—scoring of previously generated protein conformations with the ensemble computing facility, folding of peptides via replica exchange simulations, and the prediction of a missing fragment in the context of a known structure.
Generation of an ensemble data structure from input files
As the first step, an ensemble is generated from the set of predicted structures by using the checkin.pl utility:
The file names of the predictions submitted to CASP follow the format T0125∗.pdb for this target and the structures are given an identifying tag casp in the newly created ensemble. Now that the predicted structures are available in ensemble format, ensemble computing tools can be used for further processing.
Preprocessing of input structures
Depending on how the input structures were generated, it is often a good idea to regularize and minimize the structures before calculating energy scores. Many structure predictions do not contain a complete set of atoms. Often, hydrogen atoms are missing and some predictions may consist only of Cα coordinates. Therefore, as the first step we will run the complete.pl utility in order to generate complete, all-atom structures for all of the predictions. Since we want to apply this command to all
Evaluation of scoring function
Finally, we can evaluate a scoring function for the minimized structures. The ensemble computing tool for energy evaluation, enseval.pl, is used as follows:
Here, we are using a scoring function that includes implicit solvation based on a generalized Born formalism [27], in this case the GBMV method [39], [40], as implemented in CHARMM as the default when GB is requested. Again, four CPUs are used in parallel to speed up the calculation. In this case the total energy of the entire molecular
Analysis of results
Once the enseval.pl run is complete the scores are available and can be queried with getprop.pl. A sorted list of all values is obtained easily with the following command:
In this command the property name score and structure tag min are used to identify the data set. Such a result may be sufficient for many applications, but often it is advantageous to form clusters of input structures based on mutual similarity and then compare average scores over cluster members to identify the lowest scoring
Generation of model conformations from lattice simulations
In the first step, conformations for the missing fragment are generated using lattice-based low-resolution sampling. As input for this step only a sequence file and the template structure are needed. The sequence file contains the entire sequence for the template as well as the missing part. It also provides secondary structure information that is trivially obtained for the template and can be predicted for the missing part with good reliability using a variety of different secondary structure
Selection of protein environment near region of interest
The sampled structures could now be minimized, scored, and analyzed as in the example above. With 200 residues, the complete protein is fairly large, and both the minimization and energy evaluation steps are relatively expensive. Since only a small part of the structure has been varied in the sampling protocol, one does not necessarily need to consider the entire structure. The part of the structure in the vicinity of the variable residues can be cut out according to a distance cutoff similar
Scoring of conformations
Following the example above, the sampled conformations are first minimized before being scored with an energy function that includes implicit solvation based on the generalized Born formalism [39], [40]:
In this case, we create a minimized structure under the tag cutmin. Options specifying restraints to keep the cutout region intact during the minimization as described above are read from an options file generated automatically by enscut.pl when the structures were reduced.
Clustering and analysis
At this point energy scores are available for the sampled conformations and we can proceed to cluster the sampled conformations based on mutual root mean square deviations with the command enscluster.pl:
Since we are primarily interested in the conformation of residues 48 to 55, we will cluster only based on these residues, disregarding the surrounding template. A quick view of the resulting clusters is available with the command showcluster.pl:
In this example, we find 7 clusters with sizes
Summary
We have introduced the MMTSB Tool Set, a collection of utilities and programming libraries aimed at enhanced sampling and multiscale modeling applications in structural biology. The tool set interfaces with the standard molecular modeling packages CHARMM and Amber for all-atom modeling and with MONSSTER for low-resolution lattice-based simulations. It adds a number of functions, such as the translation between all atom and low resolution representations, and implements replica exchange sampling
Acknowledgements
We thank the first users of the MMTSB Tool Set for testing and useful suggestions. Financial support from the NIH supported resource Multiscale Modeling Tools for Structural Biology (http://mmtsb.scripps.edu) (grant RR12255 to CLB III) is acknowledged. Support for the computational infrastructure and personnel (MF) that enabled these calculations came from the DOD through the grant DAMD17-03-2-0012 and is greatly appreciated.
References (56)
AMBER, a package of computer programs for applying molecular mechanics, normal mode analysis, molecular dynamics and free energy calculations to simulate the structural and energetic properties of molecules
Comput. Phys. Commun.
(1995)Generalized ensemble techniques and protein folding simulations
Comput. Phys. Commun.
(2002)New algorithms and the physics of proteins
Phys. A
(2003)Generalized ensemble simulations of spin systems and protein systems
Comput. Phys. Commun.
(2002)- et al.
Multiplexed-replica exchange molecular dynamics method for protein folding simulation
Biophys. J.
(2003) - et al.
Replica-exchange molecular dynamics method for protein folding
Chem. Phys. Lett.
(1999) - et al.
MONSSTER: a method for folding globular proteins with a small number of distance restraints
J. Mol. Biol.
(1997) - et al.
Implicit solvent models
Biophys. Chem.
(1999) - et al.
Backbone-dependent Rotamer library for proteins: application to side-chain prediction
J. Mol. Biol.
(1993) - et al.
Extending the accuracy limits of prediction for side-chain conformations
J. Mol. Biol.
(2001)