Journal of Molecular Biology
Volume 283, Issue 4, 6 November 1998, Pages 707-725
Journal home page for Journal of Molecular Biology

Review
Predicting function: from genes to genomes and back1

https://doi.org/10.1006/jmbi.1998.2144Get rights and content

Abstract

Predicting function from sequence using computational tools is a highly complicated procedure that is generally done for each gene individually. This review focuses on the added value that is provided by completely sequenced genomes in function prediction. Various levels of sequence annotation and function prediction are discussed, ranging from genomic sequence to that of complex cellular processes. Protein function is currently best described in the context of molecular interactions. In the near future it will be possible to predict protein function in the context of higher order processes such as the regulation of gene expression, metabolic pathways and signalling cascades. The analysis of such higher levels of function description uses, besides the information from completely sequenced genomes, also the additional information from proteomics and expression data. The final goal will be to elucidate the mapping between genotype and phenotype.

Section snippets

Genomes and function prediction

Prediction of protein function using computational tools becomes more and more important as the gap between the increasing amount of sequences and the experimental characterization of the respective proteins widens Bork and Koonin 1998, Smith 1998. With the availability of complete genomes we face a new quality in the prediction process (Table 1) as context information can be utilized when analysing particular sequences. This review focuses on the added value of genomic information on the many

What is function?

“Function” is a very loosely defined term that only makes sense in context. Most current efforts aim at predicting protein function, but there are other types of function, e.g. RNA function or organelle function, that also need to be explored. Even to describe “protein function” requires a broad range of attributes and features (Figure 1). Molecular features such as enzymatic activity, interaction partners, and pathway context are currently being predicted, but only qualitatively. Expression

Functional prediction for gene products by annotation transfer from homologous sequences

When homologues of a query are identified in a database search (Bork & Gibson, 1996), the annotated information of the homologue and the taxonomic, biochemical and/or molecular-biological context of the query protein are used to extrapolate possible structural and functional features of the query protein. This approach has proven extremely successful although, from a formal point of view the hypotheses generated must be experimentally verified (Eisenhaber et al., 1995). The information transfer

Sequence and annotation quality in molecular databases

Function transfer by analogy requires knowledge about the quality of sequence data and functional annotation. Concerns have been raised about an accumulation (Bork & Bairoch, 1996) and even an explosion (Bhatia et al., 1997) of errors in sequence databases.

In genome projects, two to tenfold sequence coverage is usually sampled. This is critical as automated raw data acquisition (single read) is less than 99% accurate even when using optimized sequencers and software (Ewing et al., 1998). Most

Annotating genomes

Function prediction usually starts with already assembled genomic or cDNA data: at best a complete genome (Figure 2). Several features intrinsic to DNA can be recognized first, before identification of genes and pathways, although detection of the latter enhances also the annotation of non-coding features in genomes.

Annotating individual proteins

Although homology searches are often already integrated into the gene prediction procedures, they are fully exploited only at the protein level with its higher sensitivity. Database searches are a standard technique for annotating proteins, but should be used in context with other methods (Bork & Koonin, 1998).

Incorporating proteomics data

Proteomics focuses on the protein products of the genome and their interactions rather than on DNA sequences (Humphery-Smith & Blackstock, 1997). It is thus complementary to the genomic and nucleic acid information (Kahn, 1995) exploiting novel tools such as 2D large scale analysis (Vietor & Huber, 1997) and powerful mass-spectrometry applications (Yates, 1998).

Predicting function in higher order processes

Having predicted or determined functions for as many genes as possible and having assigned their interactions as well as their expression levels, it is a challenging task to put all the information into the context of cellular processes (Figure 1). A variety of databases and tools are emerging to support this procedure.

Robustness, modularity and interdependence

When considering all the levels discussed there seems to be a discrepancy between the complex nature of the networks of genes and their interdependence (e.g. via regulation) on the one hand and the surprising robustness (e.g. horizontal gene transfer or gene loss) on the other. One way in which such robustness might be achieved is a highly modular organisation, the interdependencies of genes would then be limited to small sets. As yet we do not have a quantitative understanding of the

Acknowledgements

The order of the authors is alphabetically. This work was supported by Deutsche Forschungsgemeinschaft (Bo 1099/3-1) and BMBF (grants 01KW9602/6; 0311748; 0311617). We thank Enrique Morrett and Shamil Sunyaev for critical reading of the manuscript and David Thomas for stylistic corrections. Most of all we acknowledge the work and efforts of all our colleagues who could not be mentioned in this review due to limitations of space and time and the limited selection such a review necessarily has to

References (155)

  • T. Doerks et al.

    Protein annotationdetective work for function prediction

    Trends Genet.

    (1998)
  • A.R. Dongre et al.

    Emerging tandem-mass-sepctroscopy techniques for the rapid identification of proteins

    Trends Biotechnol.

    (1997)
  • F. Eisenhaber et al.

    Wantedsubcellular localization of proteins based on sequence

    Trends Cell Biol.

    (1998)
  • J.D. Esko et al.

    Influence of core protein sequence on glycosamino-glycan assembly

    Curr. Opin. Struc. Biol.

    (1996)
  • T. Etzold et al.

    SRSinformation retrieval system for molecular biology data banks

    Methods Enzymol.

    (1996)
  • T. Gaasterland et al.

    Fully automated genome analysis that reflects user needs and preferences. A detailed introduction to the MAGPIE system architecture

    Biochimie

    (1996)
  • L.M. Gelbert et al.

    Will genetics really revolutionize the drug discovery process?

    Curr. Opin. Biotechnol.

    (1997)
  • R. Guigo

    Computational gene identificationan open problem

    Comput. Chem.

    (1997)
  • M.A. Huynen et al.

    Genomicsdifferential genome analysis applied to the species specific features of Helicobacter pylori

    FEBS Letters

    (1998)
  • M.A. Huynen et al.

    Homology-based fold predictions for Mycoplasma genitalium proteins

    J. Mol. Biol.

    (1998)
  • J. Jurka et al.

    CENSORa program for identification and elimination of repetitive elements from DNA sequences

    Comput. Chem.

    (1996)
  • E.V. Koonin et al.

    Prokaryotic genomesthe emerging paradigm of genome-based microbiology

    Curr. Opin. Genet. Dev.

    (1997)
  • B. Küster et al.

    Identifying proteins and post-translational modifications by mass spectrometry

    Curr. Opin. Struct. Biol.

    (1998)
  • A. Lupas

    Predicting coiled coil regions in proteins

    Curr. Opin. Struct. Biol.

    (1997)
  • C. Medigue et al.

    Evidence for horizontal transfer in Escherichia coli speciation

    J. Mol. Biol.

    (1991)
  • A.J. Mighell et al.

    Alu sequences

    FEBS Letters

    (1997)
  • A.R. Mushegian et al.

    Gene order is not conserved in bacterial evolution

    Trends Genet.

    (1996)
  • S.F. Altschul et al.

    Gapped Blast and PSI-Blast, a new generation of protein database search programs

    Nucl. Acids Res.

    (1997)
  • L. Anderson et al.

    A comparison of selected mRNA and protein abundances in human liver

    Electrophoresis

    (1997)
  • M. Andrade et al.

    Automatic annotation for biological sequences by extraction of keywords from MEDLINE abstracts

    Intelligent Systems Mol. Biol

    (1997)
  • M. Andrade et al.

    Sequence analysis of the Methanococcus jannaschii genome and the prediction of protein function

    Comput. Appl. Biosci.

    (1997)
  • R.S. Annan et al.

    The essential role of mass spectroscopy in characterization of protein structuremapping post-translational modifications

    J. Protein Chem.

    (1997)
  • I.T. Arkin et al.

    Are there dominant membrane protein families with a given number of helices?

    Proteins: Funct. Genet.

    (1997)
  • T.K. Attwood et al.

    The PRINTS protein fingerprint database in its fifth year

    Nucl. Acids Res.

    (1998)
  • Y. d’Aubenton-Carafa et al.

    Prediction of rho-independent Escherichia coli transcription terminators. A statistical analysis of their RNA stem-loop, structures

    J. Mol. Biol.

    (1990)
  • S. Audic et al.

    Self-identification of protein-coding regions in microbial genomes

    Proc. Natl Acad. Sci. USA

    (1998)
  • A. Bairoch et al.

    The SWISS-PROT protein sequence databank and its supplement TrEMBL

    Nucl. Acids Res.

    (1998)
  • A. Bairoch et al.

    The PROSITE database, its status and progress

    Nucl. Acids Res.

    (1997)
  • U. Bhatia et al.

    Dealing with database explosiona cautionary note

    Science

    (1997)
  • F.R. Blattner et al.

    The complete genome sequence of Escherichia coli K-12

    Science

    (1997)
  • M.S. Boguski et al.

    Gene discovery in dbEST

    Science

    (1994)
  • P. Bork et al.

    Predicting function from protein sequenceWhere are the bottlenecks?

    Nature Genet.

    (1998)
  • C.J. Bult et al.

    Complete genome sequence of the methanogenic archaeon, Methanococcus jannaschii

    Science

    (1996)
  • V. Brendel et al.

    Terminators of transcription with RNA polymerase from Escherichia coliwhat they look like and how to find them

    J. Biomol. Struct. Dynan.

    (1986)
  • R. Brent et al.

    Understanding gene and allele function with two-hybrid methods

    Annu. Rev. Genet.

    (1997)
  • L. Chan et al.

    A computer method for finding common base paired helices in aligned sequencesapplication to the analysis of random sequences

    Nucl. Acids Res.

    (1990)
  • E. Chargaff et al.

    The Nucleic Acids

    (1955)
  • P.D. Chastian et al.

    CTG repeats associated with human genetic disease are inherently flexible

    J. Mol. Biol.

    (1998)
  • M. Chee et al.

    Accessing genetic information with high-density DNA arrays

    Science

    (1996)
  • P. Colas et al.

    Genetic selection of peptide aptamers that recognize and inhibit cyclin-dependent kinase 2

    Nature

    (1997)
  • Cited by (0)

    1

    Edited by P. E. Wright

    2

    Present address: Y. Diaz-Lazcoz, Laboratoire Genome et Informatique; Batiment BUFFON, Universite de Versailles-Saint Quentin, 45, avenue des Etats-Unis, 78035 Versailles Cedex, France.

    View full text