Discrimination between Distant Homologs and Structural Analogs: Lessons from Manually Constructed, Reliable Data Sets

https://doi.org/10.1016/j.jmb.2007.12.076Get rights and content

Abstract

A natural way to study protein sequence, structure, and function is to put them in the context of evolution. Homologs inherit similarities from their common ancestor, while analogs converge to similar structures due to a limited number of energetically favorable ways to pack secondary structural elements. Using novel strategies, we previously assembled two reliable databases of homologs and analogs. In this study, we compare these two data sets and develop a support vector machine (SVM)-based classifier to discriminate between homologs and analogs. The classifier uses a number of well-known similarity scores. We observe that although both structure scores and sequence scores contribute to SVM performance, profile sequence scores computed based on structural alignments are the best discriminators between remote homologs and structural analogs. We apply our classifier to a representative set from the expert-constructed database, Structural Classification of Proteins (SCOP). The SVM classifier recovers 76% of the remote homologs defined as domains in the same SCOP superfamily but from different families. More importantly, we also detect and discuss interesting homologous relationships between SCOP domains from different superfamilies, folds, and even classes.

Introduction

Three-dimensional structural similarities among proteins are explained by either divergence or convergence. In divergent evolution, homologs inherit similar structures from their common ancestor. In convergent evolution, proteins from distinct evolutionary lineages arrive independently at similar structures due to a limited number of energetically favorable ways to pack secondary structural elements (SSEs),1, 2, 3 and such proteins are called analogs. Judging if two structurally similar proteins are homologous or analogous remains a difficult task. Statistically significant sequence similarity, as detected by sequence search tools such as PSI-BLAST,4 is generally accepted as adequate evidence for homology.5, 6 In the absence of significant sequence similarity, remote homology inference can be based on overall structural similarity, augmented by other properties such as similar arrangements of functional residues, common ligand-binding modes, shared unusual structural features, and similar domain organizations.7, 8 However, capturing these features often requires visual inspection by human experts and is more in the realm of art than science.

The Structural Classification of Proteins (SCOP) database9 represents a comprehensive collection of manually curated homologous superfamilies of protein domains with known structures. In the SCOP hierarchy, domains with significant sequence similarity or overwhelming structural and functional similarity (close homologs) are grouped into the same family; families with convincing structural and/or functional evidence for common ancestry are grouped into the same superfamily; superfamilies with the same overall three-dimensional structure and topology but without very strong evidence for homology are grouped into the same fold; and folds are grouped into classes based on their SSE compositions. SCOP is manually maintained by human experts, and its superfamily level is regarded as the most reliable standard for remote homologs.10

Several efforts have been made to discern the boundary between homology and analogy in an automated and quantitative way. Russell et al.11 statistically analyzed structurally aligned homologous and analogous pairs and found that homologs generally retain higher sequence identity, more conserved SSEs, and solvent accessibility compared to analogs. Matsuo and Bryant12 defined a homologous core structure representing the consensus substructure in a protein family, and used the overlap of homologous core structure to distinguish homologs and analogs. Dietmann and Holm13 trained a neural network to discriminate homologs and analogs based on sequence, structure, and functional similarities. All three studies used domains in the same SCOP superfamily as homologs and domains in different SCOP superfamilies as analogs in their analysis. Given the conservative nature of the SCOP hierarchy, a potential flaw of this approach is the contamination of the analog data set by homologs. Domains in different superfamilies are not necessarily analogs and may in fact be homologous when new evidence emerges.9 For instance, through careful analysis, Ponting and Russell14 suggested that at least five SCOP superfamilies under the β-trefoil fold were actually homologous and had descended from a common ancestor.

To avoid the aforementioned ambiguity, we approach the problem of discriminating between homologs and analogs with more clear-cut and reliable data sets. Previously, we manually constructed a homolog database (MALIDUP15) composed of duplicated domain pairs and an analog database (MALISAM16) composed of three categories of analogous pairs (a hybrid motif and a core motif, an interface motif and a core motif, and an artificial protein and a natural protein). Each pair in MALIDUP or MALISAM is carefully inspected to convincingly support homology or analogy and then manually superimposed and aligned to ensure good alignment quality. In this study, we use pairs from these two databases as reliable homologs and analogs to understand the differences as well as to develop a discriminator between homology and structural analogy.

We first characterize and compare the MALIDUP and MALISAM pairs in terms of structure, sequence, and profile scores. Combining these scores, we train support vector machines (SVMs) to discriminate between the homologs in MALIDUP and the analogs in MALISAM. Since MALIDUP and MALISAM are quite small in size and may not be representative of the total protein variety found in nature, we test the resulting SVM-based classifier on the comprehensive SCOP database. We show that although the classifier is trained on the manually constructed data sets with particular statistical properties, it can recover the majority of distant homologs classified in the same SCOP superfamily but different families. Moreover, the classifier is capable of finding more distantly related pairs between SCOP superfamilies, folds, and classes. We discuss some of these interesting pairs and argue that many of them indeed represent remote homologs.

Section snippets

Comparison of homologs and analogs in the manually constructed data sets

To better understand the differences between homology and analogy, we compare the homologous pairs in MALIDUP and the analogous pairs in MALISAM in terms of aligned length, sequence identity, and RMSD of structure superposition (Fig. 1). Apparently, MALIDUP includes more pairs with longer alignments, higher sequence identity, or lower RMSD. To focus on the differences between remote homologs and structural analogs as well as to obtain balanced data sets for developing the classifier, we

Manually constructed data sets

The MALIDUP database contains manual structure-based alignments of 241 homologous pairs, while the MALISAM database contains 130 analogous pairs. As shown in Fig. 1, MALIDUP includes many close homologous pairs whose long aligned length, high sequence identity, or low RMSD is not matched by any analogous pairs in MALISAM. Since we are interested in discriminating remote homologs and analogs, we divide MALIDUP into two parts: 111 close homologous pairs (aligned length above 100 residues,

Acknowledgements

We thank Lisa Kinch for critical reading of the manuscript and helpful discussions. This work was supported by National Institutes of Health grant GM67165 and Welch foundation grant I-1505 to N.V.G.

References (69)

  • B. Vestergaard et al.

    Bacterial polypeptide release factor RF2 is structurally distinct from eukaryotic eRF1

    Mol. Cell

    (2001)
  • A.D. McLachlan

    Gene duplications in the structural evolution of chymotrypsin

    J. Mol. Biol.

    (1979)
  • S.K. Nair et al.

    X-ray structures of Myc-Max and Mad-Max recognizing DNA. Molecular bases of regulation by proto-oncogenic transcription factors

    Cell

    (2003)
  • H.H. Low et al.

    The crystal structure of ZapA and its modulation of FtsZ polymerisation

    J. Mol. Biol.

    (2004)
  • C. Momany et al.

    Crystallographic structure of a PLP-dependent ornithine decarboxylase from Lactobacillus 30a to 3.0 Å resolution

    J. Mol. Biol.

    (1995)
  • S. Xiang et al.

    The crystal structure of Escherichia coli MoeA and its relationship to the multifunctional protein gephyrin

    Structure

    (2001)
  • N.V. Grishin

    Fold change in evolution of protein structures

    J. Struct. Biol.

    (2001)
  • A.B. Boraston et al.

    Structure and ligand binding of carbohydrate-binding module CsCBM6-3 reveals similarities with fucose-specific lectins and “galactose-binding” domains

    J. Mol. Biol.

    (2003)
  • N. Nagano et al.

    One fold with many functions: the evolutionary relationships between TIM barrel families based on their sequences, structures and functions

    J. Mol. Biol.

    (2002)
  • R.R. Copley et al.

    Homology among (betaalpha)(8) barrels: implications for the evolution of metabolic pathways

    J. Mol. Biol.

    (2000)
  • L. Aravind et al.

    The many faces of the helix–turn–helix domain: transcription regulation and beyond

    FEMS Microbiol. Rev.

    (2005)
  • C.P. Ponting et al.

    Beta-propeller repeats and a PDZ domain in the tricorn protease: predicted self-compartmentalisation and C-terminal polypeptide-binding strategies of substrate selection

    FEMS Microbiol. Lett.

    (1999)
  • S. Vijay-Kumar et al.

    Comparison of the three-dimensional structures of human, yeast, and oat ubiquitin

    J. Biol. Chem.

    (1987)
  • R. Sankaranarayanan et al.

    The structure of threonyl-tRNA synthetase–tRNA(Thr) complex enlightens its repressor activity and reveals an essential zinc ion in the active site

    Cell

    (1999)
  • A.C. Dock-Bregeon et al.

    Achieving error-free translation; the mechanism of proofreading of threonyl-tRNA synthetase at atomic resolution

    Mol. Cell

    (2004)
  • M.I. Wilson et al.

    PB1 domain-mediated heterodimerization in NADPH oxidase and signaling complexes of atypical protein kinase C with Par6 and p62

    Mol. Cell

    (2003)
  • K. Uegaki et al.

    Structure of the CAD domain of caspase-activated DNase and interaction with the CAD domain of its inhibitor

    J. Mol. Biol.

    (2000)
  • R. Sadreyev et al.

    COMPASS: a tool for comparison of multiple protein alignments with assessment of statistical significance

    J. Mol. Biol.

    (2003)
  • S.F. Altschul et al.

    Gapped BLAST and PSI-BLAST: a new generation of protein database search programs

    Nucleic Acids Res.

    (1997)
  • R.F. Doolittle

    Similar amino acid sequences: chance or common ancestry?

    Science

    (1981)
  • O. Lichtarge

    Getting past appearances: the many-fold consequences of remote homology

    Nat. Struct. Biol.

    (2001)
  • Y. Matsuo et al.

    Identification of homologous core structures

    Proteins

    (1999)
  • S. Dietmann et al.

    Identification of homology in protein structure classification

    Nat. Struct. Biol.

    (2001)
  • H. Cheng et al.

    MALIDUP: a database of manually constructed structure alignments for duplicated domain pairs

    Proteins

    (2007)
  • Cited by (0)

    View full text