Elsevier

Biosystems

Volume 47, Issues 1–2, June–July 1998, Pages 119-128
Biosystems

Computation with the KEGG pathway database

https://doi.org/10.1016/S0303-2647(98)00017-3Get rights and content

Abstract

We introduce and discuss a new computational approach towards prediction and inference of biological functions from genomic sequences by making use of the pathway data in KEGG. Due to its piecewise nature, the current approach of predicting each gene function based on sequence similarity searches often fails to reconstruct cellular functions with all necessary components. The pathway diagram in KEGG, which may be considered a wiring diagram of molecules in biological systems, can be utilised as a reference for functional reconstruction. KEGG also contains binary relations that represent molecular interactions and relations and that can be utilised for computing and comparing pathways.

Introduction

Living organisms behave as complex systems which are flexible and adaptive to their surroundings. At the molecular level, organisms consist of intricate networks of molecular reactions, which are often called biochemical pathways. Scientists in the fields of biochemistry, molecular genetics, and molecular biology have made great endeavours to uncover various types of biochemical pathways and to abstract common features in different pathways and in different organisms. The best characterised among them are the metabolic pathways that involve enzymatic reactions of chemical compounds. While computational biologists have long been tackling the problem of modelling, simulating, and predicting the metabolic pathways (Franco and Canela, 1984, Brutlag et al., 1991, Cohen and Bergman, 1994, Hofestädt and Meineke, 1995, Okamoto et al., 1996), the recent explosion of genomic sequence data poses new problems and raises new interests, which is the subject of this manuscript.

At the time of this writing, the complete genomic sequences of nine species are made publicly available. Genomic sequencing has become essential towards elucidating and understanding complicated molecular systems in a cell. Although systematic biochemical and genetic experiments will be necessary in the following phase of functional genomics, computational approaches are also expected to take a large and complementary part of the functional characterisation. During the last decade, the sequence databases, combined with the progress in sequence similarity search methods, have proved extremely useful for functional prediction of a single gene or a single molecule. However, in order to characterise the cellular function that results from a network of interacting molecules, a new ‘pathway’ database must be developed for comparison and computation at a higher level of the biological system. Basically, there are two major problems in representing and computing biochemical pathways.

First, the vast amount of knowledge on molecular pathways that has been accumulated for different cells and different organisms is underrepresented in databases and dispersed over literature. Even in limited attempts to computerise such knowledge, the pathway data are usually entered in databases primarily for browsing purposes. In contrast, we consider that the pathway knowledge should be computerised for the purposes of computation. It is expected that there will be complete genomes of hundreds of species from the three domains of life, Archaea, Bacteria, and Eucarya (Woese et al., 1990, Koonin, 1997). The in silico reconstruction of metabolic pathways is already an essential tool for functional assignment of predicted genes, for almost no data exist by biochemical experiments (Mushegian and Koonin, 1996, Danchin, 1997).

Second, new bioinformatics technologies need be developed to assist functional prediction from sequence data, especially by incorporating systems views on molecular pathways and molecular assemblies. It should be emphasised that the complete genomes of yeast and several bacteria still contain one third to over one half of ORFs that are left uncharacterised because no significant hits are found with well characterised sequences in the existing databases. The threshold of significant hits is somewhat arbitrary and it is often determined with consideration on the expected ratio of true predictions to false predictions (Brenner et al., 1995, Hubbard, 1996). However, a more desirable bioinformatics approach should mimic and automate the biologists’ reasoning steps to effectively decrease the threshold when additional biological information is found to be associated with the sequence similarity.

KEGG (Kyoto Encyclopedia of Genes and Genomes) is our new bioinformatics project initiated in 1995 at Kyoto University (Kanehisa, 1997a). KEGG aims at:

  • organising and computerising all current knowledge of molecular and genetic pathways from experimental observations,

  • maintaining the gene catalogue of every organism that has been sequenced and mapping each gene product onto a component in the pathway, and

  • developing new bioinformatics technologies for comparing and computing pathways.

The thrust of the project is to describe, utilise, predict, and possibly design systems behaviours of living organisms. Here we report the computational methods that have been developed to efficiently utilise the metabolic and genomic data in KEGG.

Section snippets

Pathway database

The KEGG pathway database consists of two sections: the metabolic pathway section and the regulatory pathway section. While the current knowledge of metabolic pathways is well organised in KEGG, the organisation of regulatory pathways is still rudimentary. The metabolic pathway data were first entered from the book ‘Metabolic Maps’ compiled by the Japanese Biochemical Society (Nishizuka, 1980, Nishizuka, 1997) and the wall chart of ‘Biochemical Pathways’ by Boehringer-Mannheim (Gerhard, 1992).

Reconstructing organism-specific pathways

An organism-specific pathway is automatically generated by matching the reference pathway diagram and the gene catalogue according to the EC number. When the gene for an enzyme exists in the gene catalogue, the box representing the corresponding enzyme is marked by colour on the pathway. The consecutive appearance of the coloured boxes would then be considered an organism-specific pathway (Fig. 2). In order for this procedure to be successful, the reference diagram should contain all known

Discussion

In this paper, we have described a computational approach that is intended to assist in silico investigation of gene functions and molecular pathways. This has become feasible because of the unique pathway database being organised by KEGG. A biochemical pathway is an abstraction of a subset of intricate networks in the soup of interacting biomolecules (Mavrovouniotis, 1995), and the abstraction is arbitrarily chosen by biologists at the level of their interests. KEGG’s pathway is at a highest

Acknowledgements

This work was supported in part by a Grant-in-Aid for Scientific Research on Priority Areas, ‘Genome Science’, from the Ministry of Education, Science, Sports and Culture of Japan. The computation time was provided by the Supercomputer Laboratory, Institute for Chemical Research, Kyoto University.

References (31)

  • R. Franco et al.

    Computer simulation of purine metabolism

    Eur. J. Biochem.

    (1984)
  • W. Fujibuchi et al.

    DBGET/LinkDB: an Integrated database retrieval system

    Pacific Symp. Biocomputing

    (1997)
  • T. Gaasterland et al.

    An overview of co-operative answering

    J. Intell. Inf. Syst.

    (1992)
  • Gerhard, M. (Ed.), 1992. Biological Pathways, 3rd ed., Boehringer Mannheim,...
  • Goto, S., Bono, H., Ogata, H., Fujibuchi, W., Nishioka, T., Kanehisa, M., 1996. Organising and computing metabolic...
  • Cited by (225)

    • Literature-based predictions of Mendelian disease therapies

      2023, American Journal of Human Genetics
    View all citing articles on Scopus
    View full text