Statistical Analysis of Interface Similarity in Crystals of Homologous Proteins

https://doi.org/10.1016/j.jmb.2008.06.002Get rights and content

Abstract

Many proteins function as homo-oligomers and are regulated via their oligomeric state. For some proteins, the stoichiometry of homo-oligomeric states under various conditions has been studied using gel filtration or analytical ultracentrifugation experiments. The interfaces involved in these assemblies may be identified using cross-linking and mass spectrometry, solution-state NMR, and other experiments. However, for most proteins, the actual interfaces that are involved in oligomerization are inferred from X-ray crystallographic structures using assumptions about interface surface areas and physical properties. Examination of interfaces across different Protein Data Bank (PDB) entries in a protein family reveals several important features. First, similarities in space group, asymmetric unit size, and cell dimensions and angles (within 1%) do not guarantee that two crystals are actually the same crystal form, containing similar relative orientations and interactions within the crystal. Conversely, two crystals in different space groups may be quite similar in terms of all the interfaces within each crystal. Second, NMR structures and an existing benchmark of PDB crystallographic entries consisting of 126 dimers as well as larger structures and 132 monomers were used to determine whether the existence or lack of common interfaces across multiple crystal forms can be used to predict whether a protein is an oligomer or not. Monomeric proteins tend to have common interfaces across only a minority of crystal forms, whereas higher-order structures exhibit common interfaces across a majority of available crystal forms. The data can be used to estimate the probability that an interface is biological if two or more crystal forms are available. Finally, the Protein Interfaces, Surfaces, and Assemblies (PISA) database available from the European Bioinformatics Institute is more consistent in identifying interfaces observed in many crystal forms compared with the PDB and the European Bioinformatics Institute's Protein Quaternary Server (PQS). The PDB, in particular, is missing highly likely biological interfaces in its biological unit files for about 10% of PDB entries.

Introduction

Many proteins are oligomeric due to the association of identical subunits under physiological conditions. Homo-oligomerization may be part of allosteric regulation1 or contribute to conformational and thermal stabilities2 and to higher binding affinity with other molecules. Homodimeric proteins have been found to form interactions with a larger number of other proteins compared with monomeric proteins.3 Multimerization is particularly common in enzymes, transcription factors, and signal transduction.4 The major driving forces for protein multimerization are shape and charge complementarity between the associating subunits brought about by a combination of hydrophobic and polar interactions.5, 6 Some proteins oligomerize by domain swapping, in which a segment of monomeric protein is replaced by an identical segment from another subunit and vice versa.7, 8 Many proteins have different predominant oligomeric states under different physiologically relevant conditions, and these states may have important functional differences. Homodimerization may arise in evolution because of stronger tendencies of identical interfaces to self-associate compared with dissimilar interfaces. Heterodimers of proteins in the same superfamily may then evolve from such homodimers.9

Some human diseases are caused by inherited missense mutations in proteins that cause disease in part by having an effect on oligomeric association. For instance, infantile cortical hyperostosis (Caffey disease) is a genetic disorder caused by a missense mutation in exon 41 of the gene encoding the α1(I) chain, producing abnormal disulfide-bonded dimeric α1(I) chains.10 Myofibrillar myopathy is a human disease of muscle weakening, and a causative mutation is localized in the dimerization domain of the filamin C gene, disrupting its secondary structure, leading to an inability to dimerize properly.11 Cu/Zn superoxide dismutase (SOD) is an efficient enzyme that catalyzes the conversion of superoxide to oxygen and hydrogen peroxide.12 Familial amyotrophic lateral sclerosis or Lou Gehrig's disease is associated with mutations in Cu/Zn SOD.13, 14 Some mutations destabilize the SOD dimer, causing abnormal aggregation that may be lethal to cells.

Experimental means for determining the size of an oligomer include analytical ultracentrifugation15 and gel filtration.16 These methods separate proteins and protein complexes based on their size or mass, from which the oligomerization state may be inferred. However, knowing the size of a protein oligomer does not provide information on the interacting surfaces within an oligomer or the overall structure. Combining separation of oligomers with cross-linking and mass spectrometry can be used to determine protein segments that may be in the binding interfaces between monomers.17 Fluorescence resonance energy transfer experiments can be used to identify donor–acceptor pairs of residues that must be near each other in a protein complex to identify which of several dimers in an X-ray crystal structure is likely to be physiologically relevant.18 NMR can also be used to determine detailed information on the structure and dynamics of protein oligomers in solution. However, the size of proteins that can be studied easily by NMR is limited.

For most proteins, data on oligomeric association size and, in particular, structure come from X-ray crystallography. For many proteins, the size and actual structure of multimers are controversial or unknown and are based only on what is observed in crystal structures, sometimes even a single crystal structure. Both the Protein Data Bank (PDB)19 and the European Bioinformatics Institute20, 21 provide information on “biological units” or assemblies that are the assumed biologically relevant oligomeric structures found within crystals. The PDB's biological units are based on what authors of structures themselves believe to be the biologically relevant structure, whereas those of the recently developed Protein Interfaces, Surfaces, and Assemblies (PISA) server21 from the European Bioinformatics Institute are based on the analysis of interfaces and predicted stability of complexes observed in single crystal structures. The Protein Quaternary Server (PQS)20 contains both manual and automated identifications of biological units (E. Krissinel, personal communication). The PDB and PQS usually have one biological unit size for each PDB entry, whereas the PISA server contains multiple oligomeric structures of different sizes for many PDB entries based on chemical thermodynamic calculations on complex stability. The recently developed Protein Quaternary Structure Investigation (PiQSi) database provides manually annotated sizes of biological units from the literature for PDB entries.22

Many databases and analyses have used PDB and PQS biological units to examine the interfaces between protein domains. For instance, PIBASE23 provides a list of structures for a query of two Structural Classification of Proteins (SCOP) superfamily or family designations24 and provides access to coordinates for each pairwise interaction. Interactions in the PIBASE are derived from two sources—the author-approved files provided by the PDB (e.g., pdb1ylv.ent), which generally contain the asymmetric unit of the crystal structure and many nonphysiological interactions, and the hypothetical biological units as proposed by the authors of PQS (e.g., 1ylv.mmol). The emphasis is on characterizing pairwise interfaces in terms of surface area and polar/nonpolar content. PSIMAP (Protein Structural Interactome map)/PSIBASE (database for PSIMAP)25 also performs binary searches for two SCOP-defined domains and finds all structures containing interactions between the query domains. Other databases, such as SNAPPI-DB,26 SCOPPI,27 and iPfam,28 also use the SCOP, PFAM, PDB, and PQS to define atomic interactions among protein domains. Databases of this sort are used for statistical analyses of residue contacts across interfaces to develop methods for predicting or scoring interfaces.5, 29, 30, 31, 32, 33 However, if the data in PDB and PQS are incorrect, these analyses are called into question, both in training data and testing data. Homology modeling based on known multimer structures also depends on accurate multimer structures, and incorrect biological inferences can be made when the assumed quaternary structure of the template is incorrect.

We recently compared the biological units in the PDB and PQS for all crystallographic entries in the PDB and found that they agree on only 83% of entries.34 The PDB has a higher tendency than PQS to have biological units that are identical with the asymmetric unit of the same structure, indicating perhaps that many authors may make the unwarranted assumption that the asymmetric and biological units are the same. We also found that the PDB and PQS have inconsistent assignments of biological units for proteins in multiple entries in the PDB that all have the same crystal form. This occurs in the PDB for 12% of entries and in the PQS for about 18% of entries. The PDB's assignments may be more consistent merely because a single research group may solve multiple structures within the same crystal form and assign similar biological units to all of them. When the PDB and PQS agree on the size of a multimer for a single PDB entry, they disagree on the orientation and interface between interacting monomers in less than 2% of cases. The PDB and PQS may have different interfaces across a family of closely related or identical proteins.34

A number of studies have attempted to differentiate between biological and crystallization-induced contacts. Ponstingl et al. compiled a set of 96 monomers and 76 homodimers in the PDB by reference to the published literature35 and compared the ability of buried surface area and pair interaction scores to predict biological contacts in crystals. This data set has subsequently used by others as a benchmark for methods that attempt to determine biological assemblies from single crystals.21, 22 Bahadur et al.36, 37 asembled a set of interfaces consisting of 70 heterodimeric structures, 122 homodimeric structures, and 188 crystal packing interfaces with surface areas greater than 800 Å2 and examined the physical properties of the different interface classes. Shoemaker et al.38 looked for common interfaces in different crystals of identical and homologous proteins, so-called conserved binding modes, in order to identify likely biologically relevant structures.

In this study, we examined thoroughly the interfaces in crystals of single homologous proteins. We attempted to answer several questions. First, when are two crystals of the same or similar proteins really of the same crystal form and when are they not? We surprisingly found that PDB entries with the same space group, asymmetric unit size, and quite similar unit cell dimensions are occasionally different crystal forms as judged by the interfaces and monomer–monomer orientations that exist within the crystal lattice. Conversely, two crystals in different space groups may be quite similar in terms of all or nearly all the interfaces within each crystal. This occurs (1) when one contains a subset of symmetry operators of the other and a larger asymmetric unit and (2) when one is a small distortion of the other such that the space group is different. This analysis helps sort PDB entries within a family into truly different crystal forms.

Second, we examined the hypothesis used by many crystallographers to infer biological interactions: observation of the same interface in different crystal forms of a protein (or members of the same family) suggests that the interface may be biologically relevant. We compared all interfaces in the available crystal forms in each family and determined those shared by two or more crystal forms. We determined the number of crystal forms with the interface, M, compared with the total number of different crystal forms in the same family, N. We then evaluated the usefulness of these numbers with prior benchmarks on oligomeric interactions and with NMR structures. When M is greater than 4 or 5, and especially when M is close to or equal to N, then the observed interfaces are likely to be part of biologically relevant assemblies. We found 36 families in which all N out of N crystal forms contain a particular interface in which N  10. These interfaces are very likely to be physiological. We also found that monomers in a benchmark set composed of both the Ponstingl et al. and Bahadur et al. sets tend to have M  N.

Third, we examined the usefulness of evolutionary information in evaluating interfaces appearing in more than one crystal form. It occurs often that different crystal forms of identical proteins contain common interfaces but that these usually appear in only two or three such forms and are not shared by homologous proteins. That is, they are only formed under nonphysiological crystallization conditions including high protein concentration, peculiar pH, and the presence of nonphysiological ligands. This has previously been observed for the T4 lysozyme, which has been studied in many crystal forms.39 When an interface is shared in two crystal forms by divergent proteins, then the interface is very likely to be biologically important. We also found that some interfaces in large families are restricted to one branch of a family, indicating the evolution of an interface in one branch of the family and/or loss in another. This highlights the importance of solving structures of related proteins.

Finally, we compared interfaces common to multiple crystal forms with the annotations found in the PDB, PQS, and PISA server. With an increasing number of crystal forms that contain a given interface, it becomes increasingly likely that the available annotations agree that such an interface is part of a biologically relevant assembly. The PISA server is found to be the most reliable in identifying interfaces for which the evidence, in terms of number of crystal forms containing the interface, seems very high. The PISA server is therefore the best source of biological assembly information when only one or two crystal forms are currently available.

This study is closest to the work of Shoemaker et al.,38 albeit with some important differences. First, we examined the interfaces across PDB entries of homologous proteins to determine whether they are the same crystal form, despite similarities and differences in space group, asymmetric unit size, and unit cell dimensions and angles. Shoemaker et al. separated crystal forms only by space group and/or differences in cell dimensions greater than 2%. We found that these are inadequate to classify crystals as similar or different. Second, we evaluated the usefulness of the number of different crystal forms and the evolutionary relationships of shared interfaces, neither of which was considered by Shoemaker et al. Finally, we provide in the Supplementary Data coordinate files of the shared interfaces that may be useful for further research as training or testing data.

Section snippets

Results

In this study, we are focused on homo-oligomeric structures, hence only PDB entries with a single polypeptide sequence and no nucleic acid present in the crystal. We used SCOP 1.73 to divide 16,164 entries in the PDB containing one protein sequence into families. Because this version of SCOP covers less than 70% of the current PDB, we used PSI-BLAST to assign additional single-sequence entries to SCOP families (see Materials and Methods), for a total of 19,842 entries.

Discussion

The structures of oligomeric assemblies of proteins are important for understanding their functions and regulation and the phenotypes of mutations. At the present time, there exist few repositories of experimental data on the oligomeric state of proteins in solution, such as gel filtration, analytical ultracentrifugation, and other experiments, to determine the size of protein assemblies under even approximately physiological conditions. Even these experiments may not cover the range of

Data sources

The data used in this study come from five sources: protein structure files from the PDB in XML format (PDBML)19, 63; biological unit coordinate files from PQS20 in the legacy PDB format; PISA server multimers in XML format21; domain classification files from SCOP 1.7324; and CE/PSI-BLAST hit files from a nonredundant (100%) PDB database in our laboratory.64, 65 We built a crystal form and PDB biological unit from the asymmetric unit information given in the PDB XML files. We used the most

Acknowledgements

This work was supported by the National Institutes of Health through grant R01 GM73784 awarded to R.L.D. We thank Dr. Longin Jan Latecki for useful comments. We also thank Zukang Feng and Eugene Krissinel for providing information required to build PDB and PISA server biological units, respectively. Rajib Mitra and Brian Weitzner assisted by visually examining many PDB structures to verify the ability of the parameter Q to identify similarities and differences in interfaces and crystal forms.

References (68)

  • E.D. Levy

    PiQSi: protein quaternary structure investigation

    Structure

    (2007)
  • A.G. Murzin et al.

    SCOP: a structural classification of proteins database for the investigation of sequences and structures

    J. Mol. Biol.

    (1995)
  • E.R. Jefferson et al.

    Biological units and their effect upon the properties and prediction of protein–protein interactions

    J. Mol. Biol.

    (2006)
  • D.M. Hoover et al.

    The structure of human beta-defensin-1: new insights into structural properties of beta-defensins

    J. Biol. Chem.

    (2001)
  • M. Pazgier et al.

    Studies of the biological properties of human beta-defensin 1

    J. Biol. Chem.

    (2007)
  • P. Aloy et al.

    The relationship between sequence and interaction divergence in proteins

    J. Mol. Biol.

    (2003)
  • M. Moche et al.

    Azide and acetate complexes plus two iron-depleted crystal structures of the di-iron enzyme delta9 stearoyl-acyl carrier protein desaturase. Implications for oxygen activation and catalytic intermediates

    J. Biol. Chem.

    (2003)
  • D.J. Katzmann et al.

    Ubiquitin-dependent sorting into the multivesicular body pathway requires the function of a conserved endosomal protein sorting complex, ESCRT-I

    Cell

    (2001)
  • S. Bäckström et al.

    The RUNX1 Runt domain at 1.25 Å resolution: a structural switch and specifically bound chloride ions modulate DNA binding

    J. Mol. Biol.

    (2002)
  • W.S. Valdar et al.

    Conservation helps to identify biologically relevant crystal contacts

    J. Mol. Biol.

    (2001)
  • T.W. Traut

    Dissociation of enzyme oligomers: a mechanism for allosteric regulation

    Crit. Rev. Biochem. Mol. Biol.

    (1994)
  • I. Ispolatov et al.

    Binding properties and evolution of homodimers in protein–protein interaction networks

    Nucleic Acids Res.

    (2005)
  • Y. Yarden et al.

    Epidermal growth factor induces rapid, reversible aggregation of the purified epidermal growth factor receptor

    Biochemistry

    (1987)
  • S. Jones et al.

    Principles of protein–protein interactions

    Proc. Natl Acad. Sci. USA

    (1996)
  • M.J. Bennett et al.

    3D domain swapping: a mechanism for oligomer assembly

    Protein Sci.

    (1995)
  • R.C. Gensure et al.

    A novel COL1A1 mutation in infantile cortical hyperostosis (Caffey disease) expands the spectrum of collagen-related disorders

    J. Clin. Invest.

    (2005)
  • I. Fridovich

    Superoxide dismutases

    Adv. Enzymol. Relat. Areas Mol. Biol.

    (1986)
  • D.R. Borchelt et al.

    Superoxide dismutase 1 with mutations linked to familial amyotrophic lateral sclerosis possesses significant activity

    Proc. Natl Acad. Sci. USA

    (1994)
  • Y. Furukawa et al.

    Posttranslational modifications in Cu,Zn-superoxide dismutase and mutations associated with amyotrophic lateral sclerosis

    Antioxid. Redox Signal.

    (2006)
  • T.R. Dafforn

    So how do you know you have a macromolecular complex?

    Acta Crystallogr., Sect. D: Biol. Crystallogr.

    (2007)
  • H.M. Berman et al.

    The Protein Data Bank

    Nucleic Acids Res.

    (2000)
  • F.P. Davis et al.

    PIBASE: a comprehensive database of structurally defined protein interfaces

    Bioinformatics

    (2005)
  • S. Gong et al.

    PSIbase: a database of Protein Structural Interactome map (PSIMAP)

    Bioinformatics

    (2005)
  • E.R. Jefferson et al.

    SNAPPI-DB: a database and API of Structures, iNterfaces and Alignments for Protein–Protein Interactions

    Nucleic Acids Res.

    (2007)
  • Cited by (95)

    • PDB-wide identification of physiological hetero-oligomeric assemblies based on conserved quaternary structure geometry

      2021, Structure
      Citation Excerpt :

      Several works also combined multiple features in machine-learning classifiers, as implemented in Dimovo, IPAC, IchemPic, NOXclass, RPAIAnalyst, PRODIGY-CRYSTAL, or PIACO (Bernauer et al., 2008; Fukasawa and Tomii, 2019; Hu et al., 2018; Jiménez-García et al., 2019; Mitra and Pal, 2011; Silva et al., 2015; Zhu et al., 2006). ProtCID has taken another approach by searching interfaces observed across multiple crystal forms of a protein or its homologs (Xu and Dunbrack, 2020; Xu et al., 2008). While numerous methods and resources discriminate crystal interfaces from physiologically relevant interfaces, as recently reviewed (Capitani et al., 2016; Dey and Levy, 2018; Elez et al., 2020; Xu and Dunbrack, 2019), it is noteworthy that only a few methods make predictions on the whole protein assembly.

    • Structure, function, and regulation of thioesterases

      2020, Progress in Lipid Research
      Citation Excerpt :

      The active site residues have not been formally tested, however the highly conserved Asp13 is predicted to be involved in catalysis based on structural superposition of related thioesterases (Fig. 5) [13]. This family contains a single hotdog domain that likely associates as a “face-to-face” tetramer of hotdog domains (Fig. 5), based on binding interface analysis (PISA [34]) and structural homology with other tetrameric hotdog thioesterases. The function of FadM was first identified through deletion experiments in E. coli involving TesA and TesB [35].

    • Principles and characteristics of biological assemblies in experimentally determined protein structures

      2019, Current Opinion in Structural Biology
      Citation Excerpt :

      Applying these transformations to the asymmetric unit directly sometimes creates asymmetric unit copies in neighboring unit cells, rather than the unit cell containing the original ASU. To solve this problem [5], it is necessary to use the scale matrix in the PDB file to transform the ASU to fractional coordinates. It is then possible to determine which unit cell (±i, ±j, ±k) the center of mass of the original ASU belongs to.

    • MET-activating residues in the B-repeat of the Listeria monocytogenes invasion protein InlB

      2016, Journal of Biological Chemistry
      Citation Excerpt :

      The asymmetric unit contains four monomers that are arranged into two structurally equivalent pairs (chains A + B and chains C + D). The recurrence of packing contacts in crystals suggests that they may be biologically relevant (53). According to the PISA server (54), these homodimeric quaternary structures fall into a gray region of complex formation criteria, so that the complex may or may not be stable in solution.

    View all citing articles on Scopus
    View full text