Journal of Molecular Biology
Discrimination between Distant Homologs and Structural Analogs: Lessons from Manually Constructed, Reliable Data Sets
Introduction
Three-dimensional structural similarities among proteins are explained by either divergence or convergence. In divergent evolution, homologs inherit similar structures from their common ancestor. In convergent evolution, proteins from distinct evolutionary lineages arrive independently at similar structures due to a limited number of energetically favorable ways to pack secondary structural elements (SSEs),1, 2, 3 and such proteins are called analogs. Judging if two structurally similar proteins are homologous or analogous remains a difficult task. Statistically significant sequence similarity, as detected by sequence search tools such as PSI-BLAST,4 is generally accepted as adequate evidence for homology.5, 6 In the absence of significant sequence similarity, remote homology inference can be based on overall structural similarity, augmented by other properties such as similar arrangements of functional residues, common ligand-binding modes, shared unusual structural features, and similar domain organizations.7, 8 However, capturing these features often requires visual inspection by human experts and is more in the realm of art than science.
The Structural Classification of Proteins (SCOP) database9 represents a comprehensive collection of manually curated homologous superfamilies of protein domains with known structures. In the SCOP hierarchy, domains with significant sequence similarity or overwhelming structural and functional similarity (close homologs) are grouped into the same family; families with convincing structural and/or functional evidence for common ancestry are grouped into the same superfamily; superfamilies with the same overall three-dimensional structure and topology but without very strong evidence for homology are grouped into the same fold; and folds are grouped into classes based on their SSE compositions. SCOP is manually maintained by human experts, and its superfamily level is regarded as the most reliable standard for remote homologs.10
Several efforts have been made to discern the boundary between homology and analogy in an automated and quantitative way. Russell et al.11 statistically analyzed structurally aligned homologous and analogous pairs and found that homologs generally retain higher sequence identity, more conserved SSEs, and solvent accessibility compared to analogs. Matsuo and Bryant12 defined a homologous core structure representing the consensus substructure in a protein family, and used the overlap of homologous core structure to distinguish homologs and analogs. Dietmann and Holm13 trained a neural network to discriminate homologs and analogs based on sequence, structure, and functional similarities. All three studies used domains in the same SCOP superfamily as homologs and domains in different SCOP superfamilies as analogs in their analysis. Given the conservative nature of the SCOP hierarchy, a potential flaw of this approach is the contamination of the analog data set by homologs. Domains in different superfamilies are not necessarily analogs and may in fact be homologous when new evidence emerges.9 For instance, through careful analysis, Ponting and Russell14 suggested that at least five SCOP superfamilies under the β-trefoil fold were actually homologous and had descended from a common ancestor.
To avoid the aforementioned ambiguity, we approach the problem of discriminating between homologs and analogs with more clear-cut and reliable data sets. Previously, we manually constructed a homolog database (MALIDUP15) composed of duplicated domain pairs and an analog database (MALISAM16) composed of three categories of analogous pairs (a hybrid motif and a core motif, an interface motif and a core motif, and an artificial protein and a natural protein). Each pair in MALIDUP or MALISAM is carefully inspected to convincingly support homology or analogy and then manually superimposed and aligned to ensure good alignment quality. In this study, we use pairs from these two databases as reliable homologs and analogs to understand the differences as well as to develop a discriminator between homology and structural analogy.
We first characterize and compare the MALIDUP and MALISAM pairs in terms of structure, sequence, and profile scores. Combining these scores, we train support vector machines (SVMs) to discriminate between the homologs in MALIDUP and the analogs in MALISAM. Since MALIDUP and MALISAM are quite small in size and may not be representative of the total protein variety found in nature, we test the resulting SVM-based classifier on the comprehensive SCOP database. We show that although the classifier is trained on the manually constructed data sets with particular statistical properties, it can recover the majority of distant homologs classified in the same SCOP superfamily but different families. Moreover, the classifier is capable of finding more distantly related pairs between SCOP superfamilies, folds, and classes. We discuss some of these interesting pairs and argue that many of them indeed represent remote homologs.
Section snippets
Comparison of homologs and analogs in the manually constructed data sets
To better understand the differences between homology and analogy, we compare the homologous pairs in MALIDUP and the analogous pairs in MALISAM in terms of aligned length, sequence identity, and RMSD of structure superposition (Fig. 1). Apparently, MALIDUP includes more pairs with longer alignments, higher sequence identity, or lower RMSD. To focus on the differences between remote homologs and structural analogs as well as to obtain balanced data sets for developing the classifier, we
Manually constructed data sets
The MALIDUP database contains manual structure-based alignments of 241 homologous pairs, while the MALISAM database contains 130 analogous pairs. As shown in Fig. 1, MALIDUP includes many close homologous pairs whose long aligned length, high sequence identity, or low RMSD is not matched by any analogous pairs in MALISAM. Since we are interested in discriminating remote homologs and analogs, we divide MALIDUP into two parts: 111 close homologous pairs (aligned length above 100 residues,
Acknowledgements
We thank Lisa Kinch for critical reading of the manuscript and helpful discussions. This work was supported by National Institutes of Health grant GM67165 and Welch foundation grant I-1505 to N.V.G.
References (69)
- et al.
Why do globular proteins fit the limited set of folding patterns?
Prog. Biophys. Mol. Biol.
(1987) - et al.
Structurally analogous proteins do exist!
Structure (London)
(2004) - et al.
Review: what can structural classifications reveal about protein evolution?
J. Struct. Biol.
(2001) Similar amino acid sequences revisited
Trends Biochem. Sci.
(1989)- et al.
Evolution of protein structures and functions
Curr. Opin. Struct. Biol.
(2002) How far divergent evolution goes in proteins
Curr. Opin. Struct. Biol.
(1998)- et al.
SCOP: a structural classification of proteins database for the investigation of sequences and structures
J. Mol. Biol.
(1995) - et al.
Recognition of analogous and homologous protein folds: analysis of sequence and structure conservation
J. Mol. Biol.
(1997) - et al.
Identification of distant homologues of fibroblast growth factors suggests a common ancestor for all beta-trefoil proteins
J. Mol. Biol.
(2000) - et al.
Protein structure comparison by alignment of distance matrices
J. Mol. Biol.
(1993)