Bayesian shadows of molecular mechanisms cast in the light of evolution

https://doi.org/10.1016/j.tibs.2006.05.002Get rights and content

A great many carefully designed experiments will be required to fully understand biological mechanisms in atomic detail. A complementary approach is to use powerful statistical procedures to rapidly test numerous scientific hypotheses using vast numbers of protein sequences – the cell's own blueprints for specifying biological mechanisms. Bayesian inference of the evolutionary constraints imposed on functionally divergent proteins can reveal key components of the molecular machinery and thereby suggest likely mechanisms to test experimentally. This approach is demonstrated by considering how DNA polymerase clamp-loader AAA+ ATPases couple DNA recognition to ATP hydrolysis and clamp loading.

Introduction

A principal goal of modern biology is to understand biological mechanisms in atomic detail. Given the astonishing success of biochemistry and structural biology over the past 50 years, however, it is easy to forget how far we still are from achieving this goal. Indeed, despite remarkable progress in determining protein structures, many aspects of protein mechanisms – which lie at the heart of cellular processes – remain mysterious, even for well-characterized proteins.

Consider, for example, the replication factor C (RFC) clamp-loader, which, in the presence of ATP and a DNA clamp, forms a stable complex 1, 2, 3, 4, 5 that binds to RNA-primed DNA, resulting in ATP hydrolysis and loading of the clamp onto DNA [6] (Figure 1). The loaded clamp forms a homotrimeric ring that encircles DNA and facilitates processive DNA replication by binding to and preventing DNA polymerase from falling off (7, 8, 9, 10, 11; reviewed in Refs 12, 13, 14). Although the crystal structure of the yeast RFC–ATP–clamp complex [5] suggests how the clamp is properly oriented and how RNA-primed DNA is recognized, the functional implications of most of its detailed structural features are unclear. Which of these features are essential in coupling the recognition of RNA-primed DNA to the hydrolysis of ATP? Likewise, what structural principles couple ATP hydrolysis to loading of the clamp? In short, precisely how does the molecular machinery of the clamp-loader work?

One way to address these questions is to determine high-quality crystal structures of every functionally important conformational state of the RFC complex to see the dynamic changes involved in clamp loading. Moreover, determining the structural states of clamp-loaders from diverse organisms in this way would help to identify functionally crucial structural features, because typically these features are evolutionarily conserved, whereas coincidental ones are not. Of course, this approach is difficult and costly.

An alternative and complementary approach is to predict mechanisms from limited structural and biochemical data augmented by protein sequences, which are not only relatively abundant, but also reflect evolutionarily conserved structural properties associated with underlying mechanisms. Indeed, protein mechanisms have evolved through natural selection operating on sequences as the raw material; thus, in many respects sequence constraints specify those mechanisms.

Section snippets

Molecular mechanisms and protein sequences

Here, the mechanism of a protein is defined broadly to include all atomic properties that are essential to its biochemical function, which therefore include not only its ‘internal machinery’, but also important aspects of its structural fold (or ensemble of structural forms) and key sites of interaction with other cellular components. Both sequence conservation and sequence variation tell us something about protein mechanisms.

There are two sources of sequence conservation: conservation that is

Characterizing sequence constraints

Selective pressures are manifest as evolutionary constraints – in other words, as sequences that are conserved across homologous proteins from diverse organisms. Although we could characterize constraints by simply quantifying conserved positions in a multiple sequence alignment, to best predict underlying mechanisms we must also quantify constraints that distinguish a protein family of interest from various other groups of proteins from which it has functionally diverged (Box 1). Here, a

Bayesian inference

Although methods for detecting conserved sequences have existed for a long time, Bayesian inference can characterize sequence constraints in statistically rigorous and previously unexplored ways, leading to new insights. Bayesian approaches to scientific reasoning [22] have been used for many years in genetics [23], and recently in other areas of biology including gene expression [24] and signaling pathways [25].

Bayesian approaches have three advantages: (i) when given the number of empirical

Quantifying experimental uncertainty

Statistical models based on protein sequence data, although not as detailed as structurally based models, do contain considerable detail. To be certain of their validity, highly detailed models must be based on large amounts of experimental data. Bayesian inference can make the most of the available data by quantifying the degree of uncertainty and thereby ‘letting the data speak for itself’ (Box 2). This quantification can tell us whether certain inferences are reasonable or not. Without such

Applying the scientific method to non-hypothesis-driven data

The recent trend towards high-throughput, non-hypothesis-driven experimentation, followed by computational analysis of the data raises concerns regarding scientific standards. Do analyses that are based on ad hoc heuristic procedures accurately reflect biological realities or are they significantly biased by idiosyncrasies or implicit assumptions that have been inadvertently designed into those procedures?

Bayesian inference provides a way to ensure scientific rigor by directly implementing the

Modeling complex, highly correlated properties

Because complex scientific models require lots of experimental data for validation, researchers typically avoid such models by focusing on one component of a complex system at a time. This strategy, which assumes that the function of the whole is roughly the sum of the functions of its parts, fails, however, when the functions of the parts are highly correlated. Highly correlated functions seem likely for individual residues in proteins and, for that matter, for the components of many

Inferring aspects of clamp-loader mechanisms

Subunits of the RFC clamp-loader complex belong to the AAA+ class of P loop ATPases [27], which includes dozens of hierarchically arranged, functionally divergent subgroups. For example, in archaea the RFC complex consists of one large subunit (RFCL) and four copies of a small subunit (RFCS) [28], whereas in eukaryotes the small subunit has diverged into four distinct subunits with specialized functions. The eukaryotic large subunit is denoted RFC-A, whereas the four small subunits are denoted

Hypothetical mechanism for coupling DNA sensing to ATP hydrolysis

The sequence alignment in Figure 2a highlights conserved residues that are both characteristic of active RFC ATPases (subunits RFC-A to RFC-D) and uncharacteristic of (presumed) catalytically impaired RFC-E subunits. Thus, these residues are likely to have important functions that are closely associated with ATP hydrolysis and, as a detailed analysis suggests, seem to couple ATP hydrolysis to sensing of DNA.

In the RFC–ATP–clamp complex [5], one of these residues, an arginine (e.g. Arg84 in

A trigger for initiation of ATP hydrolysis

The large subunit, RFC-A, has been proposed to recognize primed DNA as a signal for DNA-dependent ATP hydrolysis [33] and, if it does, is likely to be the first subunit to undergo ATP hydrolysis and to initiate ATP hydrolysis by adjacent subunits. A DRGG motif that is uniquely conserved in RFC-A (Figure 2d) provides potential clues to this process. The arginine of this motif (Arg434 in RFC-A; Figure 3c) might function as a DNA-sensing trigger because it strikingly protrudes into the center of

Propagation of ATP hydrolysis

Constraints associated with the propagation of ATP hydrolysis to adjacent subunits are likely to be imposed on ATPases that directly interact with an adjacent ATP site – namely, subunits RFC-B to RFC-D (Figure 2e). The most characteristic residue in this category is an arginine in the α4 helix (e.g. Arg90 in RFC-B and Arg94 in RFC-C; Figure 3c). This arginine electrostatically interacts with main-chain oxygen atoms in the NxSD motif of the adjacent subunit (Figure 3c) and with conserved acidic

Clamp loading

Given that the clamp directly binds to the C-terminal end of helix α4 5, 35, the repositioning of this helix upon ATP hydrolysis might be coupled to release of the clamp onto DNA. Notably, a lysine residue (e.g. Lys109 in RFC-B; Figure 3c) that electrostatically interacts with the C-terminal end of helix α4 is subject to, by far, the strongest constraint characteristic of all clamp-loader ATPases predicted to contact an adjacent ATP site (Figure 2f).

Moreover, in the crystal structure of the

Concluding remarks

Protein sequences encode the information that the cell itself uses in specifying biological mechanisms, but so far they have been underused for predicting aspects of those mechanisms, owing – in large part – to the inherent complexity of this information. Bayesian procedures can exploit this complexity by following empirical leads to statistically sound conclusions that are too subtle to be picked up by other means. Although probabilistic in nature, such findings can point to feasible

Acknowledgements

I thank Jun S. Liu, Yuri Lazebnik and Senthil K. Muthuswamy for critically reading the manuscript and helpful comments. This work was supported by a grant from the National Institutes of Health, National Library of Medicine (LM06747).

References (37)

  • T. Oyama

    Atomic structure of the clamp loader small subunit from Pyrococcus furiosus

    Mol. Cell

    (2001)
  • D. Jeruzalmi

    Mechanism of processivity clamp opening by the δ subunit wrench of the clamp loader complex of E. coli DNA polymerase III

    Cell

    (2001)
  • R.A. Sayle et al.

    RASMOL: biomolecular graphics for all

    Trends Biochem. Sci.

    (1995)
  • T. Tsurimoto et al.

    Purification of a cellular replication factor, RF-C, that is required for coordinated synthesis of leading and lagging strands during simian virus 40 DNA replication in vitro

    Mol. Cell. Biol.

    (1989)
  • T. Tsurimoto et al.

    Functions of replication factor C and proliferating-cell nuclear antigen: functional similarity of DNA polymerase accessory proteins from human cells and bacteriophage T4

    Proc. Natl. Acad. Sci. U. S. A.

    (1990)
  • G.D. Bowman

    Structural analysis of a eukaryotic sliding DNA clamp–clamp loader complex

    Nature

    (2004)
  • G. Prelich

    The cell-cycle regulated proliferating cell nuclear antigen is required for SV40 DNA replication in vitro

    Nature

    (1987)
  • Z. Kelman et al.

    DNA polymerase III holoenzyme: structure and function of a chromosomal replicating machine

    Annu. Rev. Biochem.

    (1995)
  • Cited by (13)

    View all citing articles on Scopus
    View full text