Journal of Molecular Biology
Prediction and Functional Analysis of Native Disorder in Proteins from the Three Kingdoms of Life
Introduction
One of the central tenets of structural biology is that the function of a protein is determined by its three-dimensional structure. As a result, predicting protein structure has often been at the forefront of efforts to infer function. However, it appears that a large proportion of protein sequences do not form complete globular structures. The natively disordered regions within these proteins may adopt an ensemble of structural states with transitions between the states leading to dynamic flexibility of the protein structure1 or have non-globular structures that are extended in the solvent.2
It has been shown experimentally that disordered regions are involved in DNA-binding3 and several other types of molecular recognition. One of the advantages of disordered binding sites is that their multiple metastable conformations allow them to recognize several targets with high specificity and low affinity.4 Transitions between the native unfolded state and a globular structure, induced by phosphorylation or some other type of interaction, may also provide thermodynamic regulation of binding. The prediction of disordered regions would therefore provide a first step in methods for identifying functionally relevant disordered regions and the flexible segments that hinder successful crystallization of the protein.
It has been shown in a series of papers5., 6., 7., 8. that there are clear patterns that characterize disordered regions such as low sequence complexity, amino acid compositional bias (e.g. towards aromatic residues) and high flexibility, and that disorder can be predicted successfully from amino acid sequence. We describe here the development of a new method for predicting native disorder. The classifier, DISOPRED2, is benchmarked on targets from the last critical assessment of structure prediction (CASP) experiment, which included an evaluation of the latest disorder prediction methods.9
The new method is also used to investigate disorder in several archaea, eubacteria and eukaryote genomes. Previous genome-wide analyses of disordered regions have been based on classifiers with high false positive rates (16% for disordered segments longer than 40 residues).10., 11. Although the results presented here cannot be interpreted as a lower bound on the proportion of proteins that contain disorder, they are intended to be very conservative with false positive rates estimated to be lower than 0.5% on long disordered segments.
The functions of potentially disordered proteins are also investigated using the gene ontology (GO) annotations12 for the budding yeast Saccharomyces cerevisiae. The aim of the analysis was to investigate which processes rely directly on dynamic flexibility of the protein structure. This was achieved by mapping each long disordered segment to the GO annotations attached to its parent protein. The frequency with which each GO term occurred was then compared to its frequency of occurrence in random simulations. The random model corresponds to a null hypothesis whereby each protein's probability of containing a long disordered segment is proportional to its length. The replicates were then used to provide confidence estimates, under the null model, for GO terms that were over- or under-represented in the set of disorder predictions.
Section snippets
Estimating error rates
The false positive rate for DISOPRED2 was established by classifying a set of 7169 ordered proteins with less than 95% sequence similarity to each other. (All residues for the protein set have atomic co-ordinates recorded in the Protein Data Bank (PDB).13) This threshold allows inclusion of a large proportion of the PDB but removes multiple models of the same structure or very close homologues. Although DISOPRED2 was developed with the aim of optimizing per residue accuracy, it is important to
Discussion
The difficulty in investigating dynamically flexible polypeptide sequences is the main reason for the relative paucity of experimental data on native disorder compared with globular structures. This difficulty also extends to the identification of disordered regions for the purposes of pattern recognition. The definition of native disorder is also fairly heterogeneous as it applies to global structures such as collapsed molten globule proteins and extended random coil-like proteins, and to the
Recognition of native disorder
The training set for DISOPRED2 was the same as that used to train the original version of DISOPRED28 and was composed of non-redundant chains with X-ray structures in the PDB13 and less than 25% pair-wise sequence identity. Only structures with resolutions better than 2.0Â Ă were used to ensure that missing regions were not caused by poor model quality. Disordered residues were identified by aligning the sequence of the protein chain in the SEQRES records with the sequence as specified by the
Acknowledgements
This work was supported by the Medical Research Council (J.J.W. and J.S.S.). Thanks to Stefano A. Street for assistance with the distributed computing, and David Corney and Kevin Bryson for useful discussions.
References (37)
- et al.
Intrinsically unstructured proteins: re-assessing the protein structureâfunction paradigm
J. Mol. Biol.
(1999) - et al.
Coupling of folding and binding for unstructured proteins
Curr. Opin. Struct. Biol.
(2002) - et al.
Prediction of protein secondary structure at better than 70% accuracy
J. Mol. Biol.
(1993) - et al.
Intrinsic disorder in cell-signalling and cancer-associated proteins
J. Mol. Biol.
(2002) - et al.
Regulation of the cell cycle at the G1-S transition by proteolysis of cyclin E and p27Kip1
Biochem. Biophys. Res. Commun.
(2001) - et al.
Epigenetic gene silencing in cancer initiation and progression
Cancer Letters
(2003) - et al.
The relationship between structure and function: a comprehensive survey with application to the yeast genome
J. Mol. Biol.
(1999) - et al.
Getting the most from PSI-BLAST
Trends Biochem. Sci.
(2002) - et al.
The C-terminal half of the anti-sigma factor FlgM contains a dynamic equilibrium solution structure favoring helical conformations
Biochemistry
(1998) - et al.
Folding transition in the DNA-binding domain of GCN4 on specific binding to DNA
Nature
(1990)
Intrinsic disorder and protein function
Biochemistry
Identifying disordered regions in proteins from amino acid sequences
Proc. IEEE Int. Conf. Neural Netw.
Predicting protein disorder for N-, C-, and internal regions
Genome Inform.
Sequence complexity and disordered proteins
Proteins: Struct. Funct. Genet.
The protein trinityâlinking function and disorder
Nature Biotechnol.
Evaluation of disorder predictions in CASP5
Proteins: Struct. Funct. Genet.
Flavors of protein disorder
Proteins: Struct. Funct. Genet.
Intrinsic protein disorder in complete genomes
Genome Inform.
Cited by (1700)
Recent advances in de novo computational design and redesign of intrinsically disordered proteins and intrinsically disordered protein regions
2024, Archives of Biochemistry and BiophysicsBioinformatic analysis of THAP9 transposase homolog: conserved regions, novel motifs
2024, Current Research in Structural BiologyFunctional unfoldomics: Roles of intrinsic disorder in protein (multi)functionality
2024, Advances in Protein Chemistry and Structural BiologyHistone H3 Tail Modifications Alter Structure and Dynamics of the H1 C-Terminal Domain Within Nucleosomes
2023, Journal of Molecular BiologyIntrinsic disorder may drive the interaction of PROS1 and MERTK in uveal melanoma
2023, International Journal of Biological Macromolecules