Predicting function: from genes to genomes and back

doi:10.1006/jmbi.1998.2144

Journal of Molecular Biology

Volume 283, Issue 4, 6 November 1998, Pages 707-725

https://doi.org/10.1006/jmbi.1998.2144 Get rights and content

Abstract

Predicting function from sequence using computational tools is a highly complicated procedure that is generally done for each gene individually. This review focuses on the added value that is provided by completely sequenced genomes in function prediction. Various levels of sequence annotation and function prediction are discussed, ranging from genomic sequence to that of complex cellular processes. Protein function is currently best described in the context of molecular interactions. In the near future it will be possible to predict protein function in the context of higher order processes such as the regulation of gene expression, metabolic pathways and signalling cascades. The analysis of such higher levels of function description uses, besides the information from completely sequenced genomes, also the additional information from proteomics and expression data. The final goal will be to elucidate the mapping between genotype and phenotype.

Section snippets

Genomes and function prediction

Prediction of protein function using computational tools becomes more and more important as the gap between the increasing amount of sequences and the experimental characterization of the respective proteins widens Bork and Koonin 1998, Smith 1998. With the availability of complete genomes we face a new quality in the prediction process (Table 1) as context information can be utilized when analysing particular sequences. This review focuses on the added value of genomic information on the many

What is function?

“Function” is a very loosely defined term that only makes sense in context. Most current efforts aim at predicting protein function, but there are other types of function, e.g. RNA function or organelle function, that also need to be explored. Even to describe “protein function” requires a broad range of attributes and features (Figure 1). Molecular features such as enzymatic activity, interaction partners, and pathway context are currently being predicted, but only qualitatively. Expression

Functional prediction for gene products by annotation transfer from homologous sequences

When homologues of a query are identified in a database search (Bork & Gibson, 1996), the annotated information of the homologue and the taxonomic, biochemical and/or molecular-biological context of the query protein are used to extrapolate possible structural and functional features of the query protein. This approach has proven extremely successful although, from a formal point of view the hypotheses generated must be experimentally verified (Eisenhaber et al., 1995). The information transfer

Sequence and annotation quality in molecular databases

Function transfer by analogy requires knowledge about the quality of sequence data and functional annotation. Concerns have been raised about an accumulation (Bork & Bairoch, 1996) and even an explosion (Bhatia et al., 1997) of errors in sequence databases.

In genome projects, two to tenfold sequence coverage is usually sampled. This is critical as automated raw data acquisition (single read) is less than 99% accurate even when using optimized sequencers and software (Ewing et al., 1998). Most

Annotating genomes

Function prediction usually starts with already assembled genomic or cDNA data: at best a complete genome (Figure 2). Several features intrinsic to DNA can be recognized first, before identification of genes and pathways, although detection of the latter enhances also the annotation of non-coding features in genomes.

Annotating individual proteins

Although homology searches are often already integrated into the gene prediction procedures, they are fully exploited only at the protein level with its higher sensitivity. Database searches are a standard technique for annotating proteins, but should be used in context with other methods (Bork & Koonin, 1998).

Incorporating proteomics data

Proteomics focuses on the protein products of the genome and their interactions rather than on DNA sequences (Humphery-Smith & Blackstock, 1997). It is thus complementary to the genomic and nucleic acid information (Kahn, 1995) exploiting novel tools such as 2D large scale analysis (Vietor & Huber, 1997) and powerful mass-spectrometry applications (Yates, 1998).

Predicting function in higher order processes

Having predicted or determined functions for as many genes as possible and having assigned their interactions as well as their expression levels, it is a challenging task to put all the information into the context of cellular processes (Figure 1). A variety of databases and tools are emerging to support this procedure.

Robustness, modularity and interdependence

When considering all the levels discussed there seems to be a discrepancy between the complex nature of the networks of genes and their interdependence (e.g. via regulation) on the one hand and the surprising robustness (e.g. horizontal gene transfer or gene loss) on the other. One way in which such robustness might be achieved is a highly modular organisation, the interdependencies of genes would then be limited to small sets. As yet we do not have a quantitative understanding of the

Acknowledgements

The order of the authors is alphabetically. This work was supported by Deutsche Forschungsgemeinschaft (Bo 1099/3-1) and BMBF (grants 01KW9602/6; 0311748; 0311617). We thank Enrique Morrett and Shamil Sunyaev for critical reading of the manuscript and David Thomas for stylistic corrections. Most of all we acknowledge the work and efforts of all our colleagues who could not be mentioned in this review due to limitations of space and time and the limited selection such a review necessarily has to

References (155)

M.A. Andrade et al.
Bioinformaticsfrom genome data to biological knowledge
Curr. Opin. Biotechnol.
(1997)
D.R. Bish et al.
Enzymatic reaction rate limits with constraints on equilibrium constants and experimental parameters
Biosystems
(1998)
H.P.J. Bonarius et al.
Flux analysis of underdetermined metabolic networksthe quest for the missing constraints
Trends Biotechnol.
(1997)
P. Bork et al.
Go hunting in sequence databases but watch out for the traps
Trends Genet.
(1996)
P. Bork et al.
Applying motif and profile searches
Methods Enzymol.
(1996)
P. Bork et al.
From genome sequences to protein function
Curr. Opin. Struct. Biol.
(1994)
M. Borodovsky et al.
New genes in old sequencea strategy for finding genes in the bacterial genome
Trends Biochem. Sci.
(1994)
P.O. Brown
Genome scanning methods
Curr. Opin. Genet. Dev.
(1994)
M. Burset et al.
Evaluation of gene structure prediction programs
Genomics
(1996)
T. Dandekar et al.
Conservation of gene ordera fingerprint of physically interacting proteins
Trends Biochnol. Sci.
(1998)

T. Doerks et al.

Protein annotationdetective work for function prediction

Trends Genet.

(1998)

A.R. Dongre et al.

Emerging tandem-mass-sepctroscopy techniques for the rapid identification of proteins

Trends Biotechnol.

(1997)

F. Eisenhaber et al.

Wantedsubcellular localization of proteins based on sequence

Trends Cell Biol.

(1998)

J.D. Esko et al.

Influence of core protein sequence on glycosamino-glycan assembly

Curr. Opin. Struc. Biol.

(1996)

T. Etzold et al.

SRSinformation retrieval system for molecular biology data banks

Methods Enzymol.

(1996)

T. Gaasterland et al.

Fully automated genome analysis that reflects user needs and preferences. A detailed introduction to the MAGPIE system architecture

Biochimie

(1996)

L.M. Gelbert et al.

Will genetics really revolutionize the drug discovery process?

Curr. Opin. Biotechnol.

(1997)

R. Guigo

Computational gene identificationan open problem

Comput. Chem.

(1997)

M.A. Huynen et al.

Genomicsdifferential genome analysis applied to the species specific features of Helicobacter pylori

FEBS Letters

(1998)

M.A. Huynen et al.

Homology-based fold predictions for Mycoplasma genitalium proteins

J. Mol. Biol.

(1998)

J. Jurka et al.

CENSORa program for identification and elimination of repetitive elements from DNA sequences

Comput. Chem.

(1996)

E.V. Koonin et al.

Prokaryotic genomesthe emerging paradigm of genome-based microbiology

Curr. Opin. Genet. Dev.

(1997)

B. Küster et al.

Identifying proteins and post-translational modifications by mass spectrometry

Curr. Opin. Struct. Biol.

(1998)

A. Lupas

Predicting coiled coil regions in proteins

Curr. Opin. Struct. Biol.

(1997)

C. Medigue et al.

Evidence for horizontal transfer in Escherichia coli speciation

J. Mol. Biol.

(1991)

A.J. Mighell et al.

Alu sequences

FEBS Letters

(1997)

A.R. Mushegian et al.

Gene order is not conserved in bacterial evolution

Trends Genet.

(1996)

S.F. Altschul et al.

Gapped Blast and PSI-Blast, a new generation of protein database search programs

Nucl. Acids Res.

(1997)

L. Anderson et al.

A comparison of selected mRNA and protein abundances in human liver

Electrophoresis

(1997)

M. Andrade et al.

Automatic annotation for biological sequences by extraction of keywords from MEDLINE abstracts

Intelligent Systems Mol. Biol

(1997)

M. Andrade et al.

Sequence analysis of the Methanococcus jannaschii genome and the prediction of protein function

Comput. Appl. Biosci.

(1997)

R.S. Annan et al.

The essential role of mass spectroscopy in characterization of protein structuremapping post-translational modifications

J. Protein Chem.

(1997)

I.T. Arkin et al.

Are there dominant membrane protein families with a given number of helices?

Proteins: Funct. Genet.

(1997)

T.K. Attwood et al.

The PRINTS protein fingerprint database in its fifth year

Nucl. Acids Res.

(1998)

Y. d’Aubenton-Carafa et al.

Prediction of rho-independent Escherichia coli transcription terminators. A statistical analysis of their RNA stem-loop, structures

J. Mol. Biol.

(1990)

S. Audic et al.

Self-identification of protein-coding regions in microbial genomes

Proc. Natl Acad. Sci. USA

(1998)

A. Bairoch et al.

The SWISS-PROT protein sequence databank and its supplement TrEMBL

Nucl. Acids Res.

(1998)

A. Bairoch et al.

The PROSITE database, its status and progress

Nucl. Acids Res.

(1997)

U. Bhatia et al.

Dealing with database explosiona cautionary note

Science

(1997)

F.R. Blattner et al.

The complete genome sequence of Escherichia coli K-12

Science

(1997)

M.S. Boguski et al.

Gene discovery in dbEST

Science

(1994)

P. Bork et al.

Predicting function from protein sequenceWhere are the bottlenecks?

Nature Genet.

(1998)

C.J. Bult et al.

Complete genome sequence of the methanogenic archaeon, Methanococcus jannaschii

Science

(1996)

V. Brendel et al.

Terminators of transcription with RNA polymerase from Escherichia coliwhat they look like and how to find them

J. Biomol. Struct. Dynan.

(1986)

R. Brent et al.

Understanding gene and allele function with two-hybrid methods

Annu. Rev. Genet.

(1997)

L. Chan et al.

A computer method for finding common base paired helices in aligned sequencesapplication to the analysis of random sequences

Nucl. Acids Res.

(1990)

E. Chargaff et al.

The Nucleic Acids

(1955)

P.D. Chastian et al.

CTG repeats associated with human genetic disease are inherently flexible

J. Mol. Biol.

(1998)

M. Chee et al.

Accessing genetic information with high-density DNA arrays

Science

(1996)

P. Colas et al.

Genetic selection of peptide aptamers that recognize and inhibit cyclin-dependent kinase 2

Nature

(1997)

Cited by (0)

¹: Edited by P. E. Wright

²: Present address: Y. Diaz-Lazcoz, Laboratoire Genome et Informatique; Batiment BUFFON, Universite de Versailles-Saint Quentin, 45, avenue des Etats-Unis, 78035 Versailles Cedex, France.

View full text

Journal of Molecular Biology

ReviewPredicting function: from genes to genomes and back1

Abstract

Section snippets

Genomes and function prediction

What is function?

Functional prediction for gene products by annotation transfer from homologous sequences

Sequence and annotation quality in molecular databases

Annotating genomes

Annotating individual proteins

Incorporating proteomics data

Predicting function in higher order processes

Robustness, modularity and interdependence

Acknowledgements

Curr. Opin. Biotechnol.

Biosystems

Trends Biotechnol.

Trends Genet.

Methods Enzymol.

Curr. Opin. Struct. Biol.

Trends Biochem. Sci.

Curr. Opin. Genet. Dev.

Genomics

Trends Biochnol. Sci.

Trends Genet.

Trends Biotechnol.

Trends Cell Biol.

Curr. Opin. Struc. Biol.

Methods Enzymol.

Biochimie

Curr. Opin. Biotechnol.

Comput. Chem.

FEBS Letters

J. Mol. Biol.

Comput. Chem.

Curr. Opin. Genet. Dev.

Curr. Opin. Struct. Biol.

Curr. Opin. Struct. Biol.

J. Mol. Biol.

FEBS Letters

Trends Genet.

Gapped Blast and PSI-Blast, a new generation of protein database search programs

Nucl. Acids Res.

A comparison of selected mRNA and protein abundances in human liver

Electrophoresis

Automatic annotation for biological sequences by extraction of keywords from MEDLINE abstracts

Intelligent Systems Mol. Biol

Sequence analysis of the Methanococcus jannaschii genome and the prediction of protein function

Comput. Appl. Biosci.

The essential role of mass spectroscopy in characterization of protein structuremapping post-translational modifications

J. Protein Chem.

Are there dominant membrane protein families with a given number of helices?

Proteins: Funct. Genet.

The PRINTS protein fingerprint database in its fifth year

Nucl. Acids Res.

Prediction of rho-independent Escherichia coli transcription terminators. A statistical analysis of their RNA stem-loop, structures

J. Mol. Biol.

Self-identification of protein-coding regions in microbial genomes

Proc. Natl Acad. Sci. USA

The SWISS-PROT protein sequence databank and its supplement TrEMBL

Nucl. Acids Res.

The PROSITE database, its status and progress

Nucl. Acids Res.

Dealing with database explosiona cautionary note

Science

The complete genome sequence of Escherichia coli K-12

Science

Gene discovery in dbEST

Science

Predicting function from protein sequenceWhere are the bottlenecks?

Nature Genet.

Complete genome sequence of the methanogenic archaeon, Methanococcus jannaschii

Science

Terminators of transcription with RNA polymerase from Escherichia coliwhat they look like and how to find them

J. Biomol. Struct. Dynan.

Understanding gene and allele function with two-hybrid methods

Annu. Rev. Genet.

A computer method for finding common base paired helices in aligned sequencesapplication to the analysis of random sequences

Nucl. Acids Res.

The Nucleic Acids

Review
Predicting function: from genes to genomes and back¹