Journal of Molecular Biology
ReviewPredicting function: from genes to genomes and back1
Section snippets
Genomes and function prediction
Prediction of protein function using computational tools becomes more and more important as the gap between the increasing amount of sequences and the experimental characterization of the respective proteins widens Bork and Koonin 1998, Smith 1998. With the availability of complete genomes we face a new quality in the prediction process (Table 1) as context information can be utilized when analysing particular sequences. This review focuses on the added value of genomic information on the many
What is function?
“Function” is a very loosely defined term that only makes sense in context. Most current efforts aim at predicting protein function, but there are other types of function, e.g. RNA function or organelle function, that also need to be explored. Even to describe “protein function” requires a broad range of attributes and features (Figure 1). Molecular features such as enzymatic activity, interaction partners, and pathway context are currently being predicted, but only qualitatively. Expression
Functional prediction for gene products by annotation transfer from homologous sequences
When homologues of a query are identified in a database search (Bork & Gibson, 1996), the annotated information of the homologue and the taxonomic, biochemical and/or molecular-biological context of the query protein are used to extrapolate possible structural and functional features of the query protein. This approach has proven extremely successful although, from a formal point of view the hypotheses generated must be experimentally verified (Eisenhaber et al., 1995). The information transfer
Sequence and annotation quality in molecular databases
Function transfer by analogy requires knowledge about the quality of sequence data and functional annotation. Concerns have been raised about an accumulation (Bork & Bairoch, 1996) and even an explosion (Bhatia et al., 1997) of errors in sequence databases.
In genome projects, two to tenfold sequence coverage is usually sampled. This is critical as automated raw data acquisition (single read) is less than 99% accurate even when using optimized sequencers and software (Ewing et al., 1998). Most
Annotating genomes
Function prediction usually starts with already assembled genomic or cDNA data: at best a complete genome (Figure 2). Several features intrinsic to DNA can be recognized first, before identification of genes and pathways, although detection of the latter enhances also the annotation of non-coding features in genomes.
Annotating individual proteins
Although homology searches are often already integrated into the gene prediction procedures, they are fully exploited only at the protein level with its higher sensitivity. Database searches are a standard technique for annotating proteins, but should be used in context with other methods (Bork & Koonin, 1998).
Incorporating proteomics data
Proteomics focuses on the protein products of the genome and their interactions rather than on DNA sequences (Humphery-Smith & Blackstock, 1997). It is thus complementary to the genomic and nucleic acid information (Kahn, 1995) exploiting novel tools such as 2D large scale analysis (Vietor & Huber, 1997) and powerful mass-spectrometry applications (Yates, 1998).
Predicting function in higher order processes
Having predicted or determined functions for as many genes as possible and having assigned their interactions as well as their expression levels, it is a challenging task to put all the information into the context of cellular processes (Figure 1). A variety of databases and tools are emerging to support this procedure.
Robustness, modularity and interdependence
When considering all the levels discussed there seems to be a discrepancy between the complex nature of the networks of genes and their interdependence (e.g. via regulation) on the one hand and the surprising robustness (e.g. horizontal gene transfer or gene loss) on the other. One way in which such robustness might be achieved is a highly modular organisation, the interdependencies of genes would then be limited to small sets. As yet we do not have a quantitative understanding of the
Acknowledgements
The order of the authors is alphabetically. This work was supported by Deutsche Forschungsgemeinschaft (Bo 1099/3-1) and BMBF (grants 01KW9602/6; 0311748; 0311617). We thank Enrique Morrett and Shamil Sunyaev for critical reading of the manuscript and David Thomas for stylistic corrections. Most of all we acknowledge the work and efforts of all our colleagues who could not be mentioned in this review due to limitations of space and time and the limited selection such a review necessarily has to
References (155)
- et al.
Bioinformaticsfrom genome data to biological knowledge
Curr. Opin. Biotechnol.
(1997) - et al.
Enzymatic reaction rate limits with constraints on equilibrium constants and experimental parameters
Biosystems
(1998) - et al.
Flux analysis of underdetermined metabolic networksthe quest for the missing constraints
Trends Biotechnol.
(1997) - et al.
Go hunting in sequence databases but watch out for the traps
Trends Genet.
(1996) - et al.
Applying motif and profile searches
Methods Enzymol.
(1996) - et al.
From genome sequences to protein function
Curr. Opin. Struct. Biol.
(1994) - et al.
New genes in old sequencea strategy for finding genes in the bacterial genome
Trends Biochem. Sci.
(1994) Genome scanning methods
Curr. Opin. Genet. Dev.
(1994)- et al.
Evaluation of gene structure prediction programs
Genomics
(1996) - et al.
Conservation of gene ordera fingerprint of physically interacting proteins
Trends Biochnol. Sci.
(1998)
Protein annotationdetective work for function prediction
Trends Genet.
Emerging tandem-mass-sepctroscopy techniques for the rapid identification of proteins
Trends Biotechnol.
Wantedsubcellular localization of proteins based on sequence
Trends Cell Biol.
Influence of core protein sequence on glycosamino-glycan assembly
Curr. Opin. Struc. Biol.
SRSinformation retrieval system for molecular biology data banks
Methods Enzymol.
Fully automated genome analysis that reflects user needs and preferences. A detailed introduction to the MAGPIE system architecture
Biochimie
Will genetics really revolutionize the drug discovery process?
Curr. Opin. Biotechnol.
Computational gene identificationan open problem
Comput. Chem.
Genomicsdifferential genome analysis applied to the species specific features of Helicobacter pylori
FEBS Letters
Homology-based fold predictions for Mycoplasma genitalium proteins
J. Mol. Biol.
CENSORa program for identification and elimination of repetitive elements from DNA sequences
Comput. Chem.
Prokaryotic genomesthe emerging paradigm of genome-based microbiology
Curr. Opin. Genet. Dev.
Identifying proteins and post-translational modifications by mass spectrometry
Curr. Opin. Struct. Biol.
Predicting coiled coil regions in proteins
Curr. Opin. Struct. Biol.
Evidence for horizontal transfer in Escherichia coli speciation
J. Mol. Biol.
Alu sequences
FEBS Letters
Gene order is not conserved in bacterial evolution
Trends Genet.
Gapped Blast and PSI-Blast, a new generation of protein database search programs
Nucl. Acids Res.
A comparison of selected mRNA and protein abundances in human liver
Electrophoresis
Automatic annotation for biological sequences by extraction of keywords from MEDLINE abstracts
Intelligent Systems Mol. Biol
Sequence analysis of the Methanococcus jannaschii genome and the prediction of protein function
Comput. Appl. Biosci.
The essential role of mass spectroscopy in characterization of protein structuremapping post-translational modifications
J. Protein Chem.
Are there dominant membrane protein families with a given number of helices?
Proteins: Funct. Genet.
The PRINTS protein fingerprint database in its fifth year
Nucl. Acids Res.
Prediction of rho-independent Escherichia coli transcription terminators. A statistical analysis of their RNA stem-loop, structures
J. Mol. Biol.
Self-identification of protein-coding regions in microbial genomes
Proc. Natl Acad. Sci. USA
The SWISS-PROT protein sequence databank and its supplement TrEMBL
Nucl. Acids Res.
The PROSITE database, its status and progress
Nucl. Acids Res.
Dealing with database explosiona cautionary note
Science
The complete genome sequence of Escherichia coli K-12
Science
Gene discovery in dbEST
Science
Predicting function from protein sequenceWhere are the bottlenecks?
Nature Genet.
Complete genome sequence of the methanogenic archaeon, Methanococcus jannaschii
Science
Terminators of transcription with RNA polymerase from Escherichia coliwhat they look like and how to find them
J. Biomol. Struct. Dynan.
Understanding gene and allele function with two-hybrid methods
Annu. Rev. Genet.
A computer method for finding common base paired helices in aligned sequencesapplication to the analysis of random sequences
Nucl. Acids Res.
The Nucleic Acids
CTG repeats associated with human genetic disease are inherently flexible
J. Mol. Biol.
Accessing genetic information with high-density DNA arrays
Science
Genetic selection of peptide aptamers that recognize and inhibit cyclin-dependent kinase 2
Nature
Cited by (0)
- 1
Edited by P. E. Wright
- 2
Present address: Y. Diaz-Lazcoz, Laboratoire Genome et Informatique; Batiment BUFFON, Universite de Versailles-Saint Quentin, 45, avenue des Etats-Unis, 78035 Versailles Cedex, France.