Prediction and Functional Analysis of Native Disorder in Proteins from the Three Kingdoms of Life

https://doi.org/10.1016/j.jmb.2004.02.002Get rights and content

Abstract

An automatic method for recognizing natively disordered regions from amino acid sequence is described and benchmarked against predictors that were assessed at the latest critical assessment of techniques for protein structure prediction (CASP) experiment. The method attains a Wilcoxon score of 90.0, which represents a statistically significant improvement on the methods evaluated on the same targets at CASP. The classifier, DISOPRED2, was used to estimate the frequency of native disorder in several representative genomes from the three kingdoms of life. Putative, long (>30 residue) disordered segments are found to occur in 2.0% of archaean, 4.2% of eubacterial and 33.0% of eukaryotic proteins. The function of proteins with long predicted regions of disorder was investigated using the gene ontology annotations supplied with the Saccharomyces genome database. The analysis of the yeast proteome suggests that proteins containing disorder are often located in the cell nucleus and are involved in the regulation of transcription and cell signalling. The results also indicate that native disorder is associated with the molecular functions of kinase activity and nucleic acid binding.

Introduction

One of the central tenets of structural biology is that the function of a protein is determined by its three-dimensional structure. As a result, predicting protein structure has often been at the forefront of efforts to infer function. However, it appears that a large proportion of protein sequences do not form complete globular structures. The natively disordered regions within these proteins may adopt an ensemble of structural states with transitions between the states leading to dynamic flexibility of the protein structure1 or have non-globular structures that are extended in the solvent.2

It has been shown experimentally that disordered regions are involved in DNA-binding3 and several other types of molecular recognition. One of the advantages of disordered binding sites is that their multiple metastable conformations allow them to recognize several targets with high specificity and low affinity.4 Transitions between the native unfolded state and a globular structure, induced by phosphorylation or some other type of interaction, may also provide thermodynamic regulation of binding. The prediction of disordered regions would therefore provide a first step in methods for identifying functionally relevant disordered regions and the flexible segments that hinder successful crystallization of the protein.

It has been shown in a series of papers5., 6., 7., 8. that there are clear patterns that characterize disordered regions such as low sequence complexity, amino acid compositional bias (e.g. towards aromatic residues) and high flexibility, and that disorder can be predicted successfully from amino acid sequence. We describe here the development of a new method for predicting native disorder. The classifier, DISOPRED2, is benchmarked on targets from the last critical assessment of structure prediction (CASP) experiment, which included an evaluation of the latest disorder prediction methods.9

The new method is also used to investigate disorder in several archaea, eubacteria and eukaryote genomes. Previous genome-wide analyses of disordered regions have been based on classifiers with high false positive rates (16% for disordered segments longer than 40 residues).10., 11. Although the results presented here cannot be interpreted as a lower bound on the proportion of proteins that contain disorder, they are intended to be very conservative with false positive rates estimated to be lower than 0.5% on long disordered segments.

The functions of potentially disordered proteins are also investigated using the gene ontology (GO) annotations12 for the budding yeast Saccharomyces cerevisiae. The aim of the analysis was to investigate which processes rely directly on dynamic flexibility of the protein structure. This was achieved by mapping each long disordered segment to the GO annotations attached to its parent protein. The frequency with which each GO term occurred was then compared to its frequency of occurrence in random simulations. The random model corresponds to a null hypothesis whereby each protein's probability of containing a long disordered segment is proportional to its length. The replicates were then used to provide confidence estimates, under the null model, for GO terms that were over- or under-represented in the set of disorder predictions.

Section snippets

Estimating error rates

The false positive rate for DISOPRED2 was established by classifying a set of 7169 ordered proteins with less than 95% sequence similarity to each other. (All residues for the protein set have atomic co-ordinates recorded in the Protein Data Bank (PDB).13) This threshold allows inclusion of a large proportion of the PDB but removes multiple models of the same structure or very close homologues. Although DISOPRED2 was developed with the aim of optimizing per residue accuracy, it is important to

Discussion

The difficulty in investigating dynamically flexible polypeptide sequences is the main reason for the relative paucity of experimental data on native disorder compared with globular structures. This difficulty also extends to the identification of disordered regions for the purposes of pattern recognition. The definition of native disorder is also fairly heterogeneous as it applies to global structures such as collapsed molten globule proteins and extended random coil-like proteins, and to the

Recognition of native disorder

The training set for DISOPRED2 was the same as that used to train the original version of DISOPRED28 and was composed of non-redundant chains with X-ray structures in the PDB13 and less than 25% pair-wise sequence identity. Only structures with resolutions better than 2.0 Å were used to ensure that missing regions were not caused by poor model quality. Disordered residues were identified by aligning the sequence of the protein chain in the SEQRES records with the sequence as specified by the

Acknowledgements

This work was supported by the Medical Research Council (J.J.W. and J.S.S.). Thanks to Stefano A. Street for assistance with the distributed computing, and David Corney and Kevin Bryson for useful discussions.

References (37)

  • A.K. Dunker et al.

    Intrinsic disorder and protein function

    Biochemistry

    (2002)
  • P. Romero et al.

    Identifying disordered regions in proteins from amino acid sequences

    Proc. IEEE Int. Conf. Neural Netw.

    (1997)
  • X. Li et al.

    Predicting protein disorder for N-, C-, and internal regions

    Genome Inform.

    (1999)
  • P. Romero et al.

    Sequence complexity and disordered proteins

    Proteins: Struct. Funct. Genet.

    (2001)
  • A. Dunker et al.

    The protein trinity—linking function and disorder

    Nature Biotechnol.

    (2001)
  • E. Melamud et al.

    Evaluation of disorder predictions in CASP5

    Proteins: Struct. Funct. Genet.

    (2003)
  • S. Vucetic et al.

    Flavors of protein disorder

    Proteins: Struct. Funct. Genet.

    (2003)
  • A.K. Dunker et al.

    Intrinsic protein disorder in complete genomes

    Genome Inform.

    (2000)
  • Cited by (1700)

    • Functional unfoldomics: Roles of intrinsic disorder in protein (multi)functionality

      2024, Advances in Protein Chemistry and Structural Biology
    • Intrinsic disorder may drive the interaction of PROS1 and MERTK in uveal melanoma

      2023, International Journal of Biological Macromolecules
    View all citing articles on Scopus
    View full text